Copyright 2014, Tianlan Wei

Measuring Mathematics Interest and Affect: An Item Response Theory Evaluation of

the Self Description Questionnaire I (SDQI)

by

Tianlan Wei, M.Ed.

A Dissertation

In

Educational Psychology

Submitted to the Graduate Faculty of Texas Tech University in

Partial Fulfillment of the Requirements for

the Degree of

DOCTOR OF PHILOSOPHY

Approved

Lucy Barnard-Brak, Ph.D. Chair of Committee

William Lan, Ph.D.

Tara Stevens, Ed.D.

Mark Sheridan Dean of the Graduate School

August, 2014

Copyright 2014, Tianlan Wei

Texas Tech University, Tianlan Wei, August 2014

ii

ACKNOWLEDGMENTS Many people have influenced my work on this dissertation, and I would like to

recognize each of them with my sincere appreciation. First, my deepest gratitude goes

to Lucy Barnard-Brak, my committee chair, and an incredible source of support and

inspiration during my graduate study at Texas Tech University. It was through

working with her that I was exposed to many advanced research methods and

quantitative skills beyond coursework, including the item response theory techniques,

which this dissertation is based on. Lucy means so much to me in my scholarly life: a

dedicated mentor, a most intelligent team leader, a caring elder sister, and a role model

always for me to emulate.

Next, I would like to thank William Lan and Doug Simpson for bringing me

back to academia when no one else saw my potential as a researcher. It is no

exaggeration when I say that their assistance was life-changing to me. As my

committee member, Dr. Lan’s comments on my dissertation work have always been

encouraging and constructive. He impressed upon me how kind and generous

academics can be.

Although my work has benefited from many professors and colleagues, I must

recognize the critical role of Tara Stevens, also a committee member of mine, for

establishing the foundation for my research agenda. Her earlier work on mathematics

education, particularly the gender gap and associated affective-domain factors, has had

a great influence on this dissertation as well as my other studies.

My parents are of the birth cohort that was deprived of much education due to

the political movements back then in China, and I feel grateful that they would never

fail to value my education. I would also like to thank my uncle for encouraging me to

pursue a terminal degree for years when I had so little confidence in myself.

Lastly, I would like to thank Catrina Wang for breaking years of silence to

show up at this critical stage of my dissertation work, though briefly. Fortunately and

unfortunately, some human feelings are beyond expression and any psychometric

calibration—they will forever remain unspeakable and immeasurable.


iii

TABLE OF CONTENTS ACKNOWLEDGMENTS .................................................................................... ii

ABSTRACT ........................................................................................................... v

LIST OF TABLES .............................................................................................. vii

LIST OF FIGURES ........................................................................................... viii

I. INTRODUCTION ............................................................................................. 1

Context of the Study ......................................................................................... 1 Statement of the Problem .................................................................................. 4

Theoretical Issues ........................................................................................ 4 Conceptual Issues ........................................................................................ 6 Measurement Issues .................................................................................... 8

Purpose of the Study ....................................................................................... 10 Research Questions ......................................................................................... 11 Significance of the Study ................................................................................ 11 Limitations of the Study .................................................................................. 13

II. REVIEW OF LITERATURE ....................................................................... 15

The Concept of Interest ................................................................................... 15 Personal Interest and Situational interest .................................................. 16 Interest, Affect, and Motivation ................................................................ 18

Interest and Affect in Theories of Achievement Motivatoin .......................... 20 Expectancy-Value Theory of Motivation.................................................. 22 Attribution Theory of Motivation ............................................................. 24 Goal-Orientation Theories ........................................................................ 26 Intrinsic and Extrinsic Motivation ............................................................ 27 The Roles of Interest and Affect in Achievement Motivation .................. 28

Gender, Ethnic, and Age differences in Mathematics Interest and Affect ...... 29 Gender Differences ................................................................................... 29 Ethnic Differences ..................................................................................... 32 Age Differences ........................................................................................ 34

Summary of the Literature .............................................................................. 36 Research Questions and Hypotheses ............................................................... 38

III. METHOD ...................................................................................................... 40


iv

Description of the Data ................................................................................... 40 Sampling Frame ........................................................................................ 41 Sample ....................................................................................................... 42

Instrumentation ............................................................................................... 42 Face Validity ............................................................................................. 43

Item Response Theory..................................................................................... 45 Assumptions .............................................................................................. 47 Graded Response Model ........................................................................... 48

Differential Item Functioning ......................................................................... 51 Lord's Wald Test ....................................................................................... 53

Analysis of Data .............................................................................................. 54 Phase 1: Factor Structure, Reliability, Validity, and Group Differences .. 55 Phase 2: Item Response Theory (IRT) Analysis ....................................... 57 Phase 3: Differential Item Functioning (DIF) Analysis ............................ 57

IV. RESULTS ...................................................................................................... 60

Phase 1: Factor Structure, Reliability, Validity, and Group differences ......... 62 Factor Structure ......................................................................................... 62 Internal Consistency Reliability ................................................................ 63 Criterion-Related Validity ......................................................................... 64 Group Differences ..................................................................................... 64

Phase 2: Item Response Theory (IRT) Analysis ............................................. 66 Phase 3: Differential Item Functioning (DIF) Analysis .................................. 71

DIF across Gender..................................................................................... 72 DIF across Ethnicity .................................................................................. 78 Item Parameter Drift ................................................................................. 90

V. DISCUSSION ................................................................................................. 94

The Psychometric Properties of the SDQI Mathematics Subscale ................. 94 The Classical Test Theory Perspective ..................................................... 94 The Item Response Theory Perspective .................................................... 97 Summary ................................................................................................... 99

Measurement Bias of the SDQI Mathematics Subscale across Gender ........ 100 Measurement Bias of the SDQI Mathematics Subscale across Ethnicity ..... 101 Item Parameter Drift of the SDQI Mathematics Subscale ............................ 104 Conclusion .................................................................................................... 105

BIBLIOGRAPHY ............................................................................................. 110


v

ABSTRACT In response to today’s high-tech global economy, there has been an increasing

need for professionals in the Science, Technology, Engineering, and Mathematics

(STEM) fields. To enhance students’ STEM performance and career aspirations,

researchers have suggested capturing their mathematics interest as early as elementary

school. Research on academic interest, however, has suffered from various conceptual,

theoretical, and measurement issues. In the literature, there are mixed findings

regarding gender, ethnic, and age differences in mathematics interest which has

entailed reexamination of the existing measures of interest. The purpose of this

dissertation study was to evaluate the psychometric properties of the Self Description

Questionnaire I (SDQI) with regard to children’s mathematics interest based on the

item response theory (IRT) framework.

The data for this study were drawn from the Early Childhood Longitudinal

Study, Kindergarten Class of 1998-99 (ECLS-K) with a sample of 14,631 children.

These children were assessed in third- and fifth-grade using the ECLS-K adapted

SDQI mathematics subscale, which consists of eight items with ordered response

options ranging from 1 to 4. The IRT-based evaluation suggests that the SDQI items

have sufficient discrimination, but insufficient item thresholds in assessing third- and

fifth-graders’ mathematics interest. Specifically, the IRT parameter estimates indicate

that 4-point scale of the SDQI items does not function much better than a dichotomous

scale, particularly among third-graders. Thus, it is recommended that the current

SDQI be reconsidered for its age appropriateness. In addition, affective items, which

involve enjoyment, liking, and positive emotions, are found to be better indicators of

interest than cognitive items that involve perceived competence, ability beliefs, or self-

efficacy.

In terms of measurement bias across gender, ethnic, and age groups, item-level

bias emerged in most of the items as assessed by the IRT-based Wald Test of

differential item functioning. In general, it was found that boys and White children

tend to endorse higher-order options of cognitive, perceived competence items to

represent their mathematics interest, while girls and ethnic minority children tend to


vi

endorse higher-order options of affective items. This affective-cognitive distinction of

measurement bias indicates that children may interpret and respond to self-report

interest measures differently as a function of their demographic background, resulting

in artificial item-level group differences as detected by the SDQI, although the scale

score can still be trusted in such group comparisons. Finally, results of the item

parameter drift analysis indicate that children become more sensitive to the response

anchors as they grow older, which reaffirms that the current SDQI items with the 4-

point scale performs poorly among young children. Taken together, findings of this

dissertation study provide insights into the conceptualization and operationalization of

mathematics interest and affect, as well as practical implications for future use, and

development of self-report measures in this field.

Keywords: mathematics interest, Self Description Questionnaire, item response theory,

measurement bias, gender difference, ethnic difference, item parameter drift


vii

LIST OF TABLES 3.1 The ECLS-K Adapted SDQI Mathematics Subscale. ............................... 43 4.1 Descriptive Statistics of the ECLS-K Variables Included in the

Study (N = 14,631) ........................................................................ 60 4.2 Bivariate Correlations among the SDQI Items in Third Grade ................. 61 4.3 Bivariate Correlations among the SDQI Items in Fifth Grade .................. 61 4.4 Internal Consistency Reliabilities and Factor Loadings by Item .............. 63 4.5 IRT Parameter Estimates (a, b1, b2, b3) of Third- and Fifth-

Grade Full Samples ....................................................................... 66 4.6 Summary of the DIF Results by Gender, Ethnicity, and Age ................... 72 4.7 Parameter Estimates of Multigroup (Male vs. Female) IRT in

Third Grade ................................................................................... 76 4.8 Parameter Estimates of Multigroup (Male vs. Female) IRT in

Fifth Grade .................................................................................... 75 4.9 Parameter Estimates of Multigroup (White vs. African

American, White vs. Hispanic, and White vs. Asian American) IRT in Third Grade ..................................................... 80

4.10 Parameter Estimates of Multigroup (White vs. African American, White vs. Hispanic, and White vs. Asian American) IRT in Fifth Grade....................................................... 85

4.11 Parameter Estimates of Item Parameter Drift Analysis ............................ 90


viii

LIST OF FIGURES 2.1 Three approaches to interest research ....................................................... 19 2.2 Model of triadic reciprocal causations ...................................................... 22 2.3 A social cognitive expectancy-value model of achievement

motivation ..................................................................................... 23 2.4 Overview of the general attributional model ............................................ 25 3.1 Option characteristic curves (OCCs) for a hypothetical item

with four response options (a = 1.7, b1 = -1.5, b2 = 0, b3 = 1.5) ............................................................................................. 49

3.2 Item information function (IIF) curve for a hypothetical item with four response options (a = 1.7, b1 = -1.5, b2 = 0, b3 = 1.5) ............................................................................................. 51

4.1 Group differences as assessed by the SDQI mathematics scale score .............................................................................................. 65

4.2 Option characteristic curves of the items in third grade ........................... 67 4.3 Option characteristic curves of the items in fifth grade ............................ 68 4.4 Item information function curves by item across grade levels ................. 70 4.5 Test information function curves by scale across grade levels ................. 71 4.6 Option characteristic curves of multigroup (male vs. female)

IRT in third grade .......................................................................... 75 4.7 Option characteristic curves of multigroup (male vs. female)

IRT in fifth grade .......................................................................... 77 4.8 Test characteristic function curves of multi-group (male vs.

female) IRT ................................................................................... 78 4.9 Option characteristic curves of multigroup (White vs. African

American) IRT in third grade ........................................................ 81 4.10 Option characteristic curves of multigroup (White vs. Hispanic)

IRT in third grade .......................................................................... 82 4.11 Option characteristic curves of multigroup (White vs. Asian

American) IRT in third grade ........................................................ 83 4.12 Option characteristic curves of multigroup (White vs. African

American) IRT in fifth grade ........................................................ 86 4.13 Option characteristic curves of multigroup (White vs. Hispanic)

IRT in fifth grade .......................................................................... 87 4.14 Option characteristic curves of multigroup (White vs. Asian

American) IRT in fifth grade ........................................................ 88


ix

4.15 Test characteristic function curves of multigroup (White vs. Asian, White vs. Hispanic, and White vs. Asian American) IRT .............................................................................. 89

4.16 Option characteristic curves of item parameter drift (third- vs. fifth-grade) .................................................................................... 92

4.17 Test characteristic function curves of item parameter drift ...................... 93


1

CHAPTER I

INTRODUCTION

Context of the Study

Science, Technology, Engineering, and Mathematics (STEM) represent

intellectual and cultural achievements, which constitute fundamental aspects of the

contemporary society. In response to today’s high-tech global economy, there has

been an increasing need for professionals in the STEM fields. The 2008-2018

occupational employment projections (Lacey & Wright, 2009) suggested that, along

with changes in the U.S. demographics and the increasingly competitive business

environment, computer and mathematical science occupations would see employment

growth by over 20% by 2018. Moreover, STEM has also been widely acknowledged

as a powerful engine of continued scientific leadership and prosperity of the United

States (National Research Council [NRC], 2011). Therefore, it concerns both policy

makers and employers that K-12 education in the U.S. does not seem to prepare

students adequately for the pursuit of STEM degree programs and professions. In fact,

it was reported that international students accounted for over one-third of the students

in U.S. science and engineering schools (National Academy of Sciences, National

Academy of Engineering, & Institute of Medicine, 2007), and the STEM workforce

had largely depended on these foreign-born mathematicians, scientists, and engineers

(Sanders, 2004).

Two major issues emerged as pertinent to K-12 STEM education in the U.S.

First, it appears that U.S. students’ STEM performance, particularly that of

mathematics, is less than satisfactory considering current and future economic

demands. For example, the data collected by the National Assessment of Educational

Progress (NAEP) over the 1996-2009 period indicated that about 75% of U.S. students

were not proficient in mathematics when they completed eighth grade (Schmidt, 2011).

Furthermore, whereas the NAEP data showed that the percentage of students who

were proficient in mathematics increased from 19% to 33% for fourth graders, and


2

from 20% to 26% for eighth graders, the increase was merely 2% for 12th graders (21%

to 23%), indicating a decline in students’ mathematics performance as they advanced

to high school. Recent reports regarding international assessments also indicated that

U.S. students had routinely fallen behind higher performing educational systems such

as Singapore and Hong Kong in science and mathematics performance. For example,

only 13% of U.S. fourth graders met the Trends in International Mathematics and

Science Study (TIMSS) international benchmark of advanced in mathematics, as

compared with 43% for Singapore and 39% for South Korea (Provasnik et al., 2012).

The statistics were even lower for U.S. eighth graders; only 7% of students met the

advanced benchmark as compared with 49% for Chinese Taipei and 48% for

Singapore. Other education systems such as Russia (14%), Australia (9%), and

England (8%) also outperformed the U.S. The 2009 Program for International Student

Assessment (PISA) data were no more impressive in terms of U.S. students’

mathematics literacy: the average score of U.S. 15-year-old students (487) was lower

than the Organization for Economic Cooperation and Development’s (OECD) average

score of 496, and surpassed by those of 17 other OECD countries: 541 in Finland, 527

in Canada, and 514 in Australia (Fleischman, Hopstock, Pelczar, & Shelley, 2010).

Together, these data indicate that U.S. students’ mathematics and science performance

declines as they advance to higher grades, either as measured by NAEP standards, or

within the context of international comparisons.

The second issue of STEM education in the U.S. relates to students’ career

aspirations. The diminished performance may be a function of how students’ career

aspirations in STEM fields are shaped. The decrease in the supply of college-educated

STEM graduates had caused concerns in most western countries (van Langen &

Dekkers, 2005). The proportion of postsecondary graduates in STEM fields was only

12% in the U.S. (OECD, 2003), and as documented by the National Science Board

(NSB; 2010), the overall percentage of STEM degrees awarded has further decreased

since. The NSB (2010) statistics showed that, although nearly one-fourth of all U.S.

undergraduates began their studies in STEM majors, only one-half of them would


3

graduate with a STEM degree. Researchers (e.g., Hossain & Robinson, 2012;

Korpershoek, Kuyper, Bosker, & van der Werf, 2013) argued that many students

might fail to realize their potential, and leave the STEM track before or in college due

to a lack of proper motivation.

With regards to motivation, women’s academic performances and career

aspirations in STEM fields warrant particular attention. Although the gender gap in

mathematics achievement seemed to close up slowly over the past two decades (Else-

Quest, Hyde, & Linn, 2010), women have continued to be underrepresented in the

STEM fields and leadership positions in the U.S. For example, recent data revealed

that women accounted for only 26% of the college-educated workforce in science and

engineering (NSB, 2010), and only 30% of STEM deans and department heads

(National Science Foundation [NSF], National Center for Science and Engineering

Statistics [NCSES], 2010). The NSF (2013) report further indicated that women’s

shares of postsecondary degrees in physical sciences and mathematics remained well

below those of men and that, despite the rise of full-time, full professorships held by

women over the 1993-2010 period, women represented less than one-fourth of all full-

time, full professors in the STEM fields.

Based on the above evidence, the NRC (2011) set up the primary goal for

effective K-12 STEM education as expanding the number of students who pursue

advanced STEM degrees and careers and broadening the participation of women and

minorities in these fields. To achieve this goal, educational research has sought to

identify factors that serve to maintain high levels of motivation in STEM subjects

among students. One of the major individual-level motivational factors is academic

interest (Schunk, Pintrich, & Meece, 2008). To enhance students’ performance and

intention to pursue careers in the STEM fields, researchers have suggested capturing a

student’s interest as early as elementary school (DeJarnette, 2012; Russell, Hancock,

& McCullough, 2007). This suggestion was based on the well-documented positive

associations between interest and subsequent learning (e.g., Hidi, 2000; Hidi &

Harackiewicz, 2000; Schiefele, 1991; Schiefele, Krapp, & Winteler, 1992). This


4

dissertation was, therefore, intended to address issues regarding the conceptualization

and measurement of interest as related to affective experiences within the domain of

mathematics education.

Statement of the Problem Although often conducted within the framework of achievement motivation,

research on academic interest and affect has yet to provide a widely acknowledged

theory to account for how interest influences motivation and achievement. Interest,

along with associated affective experiences, is evident in the theoretical models of

influential motivational theories such as the expectancy-value theory (Eccles et al.,

1983) and the attribution theory (Weiner, 1986, 1992), but not of as great importance

as constructs such as self-efficacy (Bandura, 1997), expectancy and task values

(Eccles et al., 1983), or locus of control (Weiner, 1986). For example, as compared

with self-efficacy, which had received much attention since the 1980s for its predictive

power over other motivational factors (Klassen & Usher, 2010), research on academic

interest appeared to lack a guiding theory, and had not advanced much over the past

two decades (Schunk et al., 2008). The relative paucity of research in this field may

be examined from several perspectives involving conceptual, theoretical, and specific

measurement issues. These perspectives, however, are not mutually exclusive.

Theoretical Issues Interest and affect were barely studied in behavioral theories of motivation,

which had dominated educational research until the 1960s. In the view of behavioral

theorists (e.g., Skinner, 1953), motivation is evidenced by a change in the rate,

frequency of occurrence, or form of behaviors in response to environmental stimuli.

In other words, students are mostly motivated by environmental events, and the focus

of behavioral research is to examine the connection between hypothesized

environmental stimuli and observable changes in behaviors. As such, behavioral

theorists contended that research on motivation does not need to include thoughts and

feelings (Schunk et al., 2008). In behaviorism, stimulus and response are directly

observable, but interest and affect are usually not. This lack of direct observation


5

alludes to the dilemma of measuring unobservable constructs regarding people’s

thoughts and feelings as a part of scientific research where there are clearly more

sources of errors in such measures (Weisberg, 2005).

Along with the emergence of social cognitive theory (Bandura, 1977, 1986),

cognitive theories of motivation largely replaced behavioral theories in guiding

research. Essentially, the social cognitive perspective of motivation added a cognitive

component to learning and motivation, and is thus more concerned about the learner’s

cognitive process. Bandura (1977) analyzed human learning and self-regulation in

terms of triadic reciprocal causations, which involve three types of determinants of

learning: personal determinants, behavioral determinants, and environmental

determinants. In other words, the cognitive perspective postulates that, in addition to

merely responding to environmental stimuli, individuals also actively process

information acquired from the environment, resulting in varying outcomes. The

inclusion of a cognitive component of motivation helps explain a large proportion of

individual variance unaccounted for by behaviorism studies.

Nonetheless, the role of interest and affect remains unclear in such cognitive

processes. The cognitive theories of motivation generally assume that people are

naïve scientists who try to understand their environments and causal determinants of

their past behaviors (Schunk et al., 2008), but this assumption may sometimes be

violated because humans do not always rely on their rationality in making judgments

and decisions. In fact, the commonly-held view that humans base their decisional

processes entirely on rationality has been increasingly challenged in social sciences

since the late 1970s (Denes-Raj & Epstein, 1994). Some theorists argued that people

process information by two parallel processing systems: a rational system which

functions based on established rules of logic and evidence and an experiential system

which functions based on the intuitive experience of affect from more concrete

exemplars and schemas (Epstein, 1990). Adopting this perspective, most cognitive

theories of motivation appear to be unbalanced in that they place greater emphasis on

the rational system, and may underestimate and underplay the roles of learners’


6

affective experiences. Hence, neither behavioral nor cognitive theories of motivation

have effectively addressed the role of interest and affective experiences in motivation,

and the lack of guiding theories may partly explain the relatively small amount of

research in this field.

Conceptual Issues Research on interest and affect in motivation have suffered from unclear,

conflicted conceptualizations. In terms of interest, motivational psychologists,

developmental psychologists, and educational psychologists have provided at least

three perspectives on interest: personal interest as an individual disposition,

interestingness as an aspect of the context, and interest as a psychological state (Krapp,

Hidi, & Renninger, 1992; Schunk et al., 2008). Among them, personal interest

reflects a relatively stable disposition; interestingness primarily concerns the context

(e.g., interestingness of an activity or task); while situational interest is viewed as a

psychological state. The three perspectives are not mutually exclusive, given that

situational interest may be triggered and maintained by interestingness as a contextual

factor (Hidi, 2000; Schunk et al., 2008).

Both personal interest and situational interest refer to a psychological state of

being interested, but they vary in levels of relation to stored knowledge and value

(Renninger & Hidi, 2002). In particular, situational interest may require little prior

knowledge, and is not necessarily associated with positive value; whereas personal

interest indicates that the individual has both stored knowledge, and positive value for

the object of interest. Renninger and Hidi (2002) further contend that it is optimal for

situational interest to evolve into personal interest through the student’s interaction

with his environment because personal interest is more of a well-developed interest.

This personal interest then helps maintain and deepen the student’s motivation in spite

of adverse situations. To summarize the three perspectives on interest in this view,

situational interest is triggered by interestingness of the academic task, and may

further evolve into personal interest depending on opportunities and support available

to the student (Hidi, 1990; Krapp, 1999).


7

Such evolution of situational interest into personal interest is certainly of great

importance to educational researchers. Research on this process may provide valuable

implications for effective learning and instruction. However, challenges also emerge

with respect to the conceptualization and measurement of these closely related

constructs. First and foremost, it seems difficult to distinguish interest from affect

when interest is viewed as being “aroused or activated as a function of interestingness

of the context” (Schunk et al., 2008, p. 213), especially in a review of the broad

definition of affect, which encompasses nearly all subjectively experienced feelings

including arousal, emotions, and moods (Blechman, 1990). Renninger’s model (e.g.,

Renninger, 1990, 1992), which attempted to account for interest by learners’ prior

knowledge and subjective value of the task, has received criticism that, as opposed to

her hypothesis, it seems intuitively possible for individuals with low prior knowledge

to be highly interested in particular activities. Thus, this model may capture personal

interest fairly well, but seems incapable of accounting for the formation of situational

interest. Moreover, very few quantitative measures are available for distinguishing the

two types of interest, let alone examining how situational interest may develop into

personal interest.

Although emotions are viewed as affect without much skepticism, there are

also a few conceptual issues with emotions. Emotions are often confused with other

affective phenomena such as moods and sentiments. Frijda (1994) suggested two

major distinctions between emotion and other affective phenomena: whether this

affective state is object-directed, and whether it can be viewed as an enduring

disposition. Emotions are object-directed and fairly short-lived as compared with

moods or sentiments. More importantly, emotions are normally unrelated to enduring

dispositions. A student feeling anxious about an upcoming test is not necessarily

prone to anxiety by nature. Sentiments, however, are used to describe dispositions

that people possess “to respond affectively to particular objects or kinds of event”

(Frijda, 1994, p. 64). When a student says he dislikes mathematics, such “dislike”

does not represent a short-lived, momentary response to mathematics; rather, it reflects


8

an affective disposition that is relatively stable. Most sentiments are acquired based

on previous experience or social learning (Frijda, 1994). For instance, a student may

have failed mathematics tests so many times that he has developed a negative

sentiment about mathematics out of negative feelings accompanying each failure.

The conceptual issues may complicate the investigations of emotions in

academic settings when researchers have different working definitions for emotion.

Some (e.g., Forgas, 2000) may define emotions (e.g., test anxiety) as short-lived,

object-directed phenomena, and others’ conceptualizations may be closer to moods or

sentiments as presented by Frijda (1994). The bulk of educational research on

emotions derived from Pekrun and colleagues based on Pekrun’s (1992) taxonomy of

academic emotions. Some emotions (e.g., retrospective emotions) in this taxonomy

seem to be more general and enduring than others (e.g., process-related emotions),

which cautions researchers about the demand of accurate measures. Hence, both the

theoretical and conceptual issues discussed above have led to specific measurement

issues in this field of research.

Measurement Issues Accurate measurement is the cornerstone of empirical research. The

overarching validity of a study is built on the reliability and validity of its

measurement(s). All types of measurements have their strengths and weaknesses, but

measurement issues particularly persist in the use of self-report instruments to elicit

responses of one’s affective state. Self-report measurements can be defined as

instruments used for collecting numerical scores from which inferences can be made

(Gall, Gall, & Borg, 2006). Self-report methods, although the most efficient and

easiest way for latent constructs to be inferred, largely rely on the assumption that

respondents are able to observe and report on their own affective states. In other

words, self-report measures require certain levels of metacognition and self-awareness

(Schunk et al., 2008), which may not be sufficiently high in children attending

elementary schools (Flavell, 1987).


9

The test validity and reliability of measurements of interest have been plagued

with the theoretical and conceptual issues discussed earlier. Test validity is defined as

the “degree to which evidence and theory support the interpretation of test scores

entailed by proposed uses of tests” (American Educational Research Association,

American Psychological Association, & National Council on Measurement in

Education, 1999, p. 9). In other words, an instrument is considered to be valid when it

measures the construct(s) it is intended to. Because researchers have yet to reach an

agreement regarding the psychological nature of interest (i.e., cognitive vs. affective),

it is difficult to ascertain the general validity of an instrument of interest.

Test reliability is sometimes conceptualized as the degree to which an

instrument measures whatever it is measuring (Gay, Mills, & Airasian, 2008), but it

essentially refers to the degree to which observed scores reflect the true score of the

construct being measured (Gall et al., 2006; Larsen & Fredrickson, 1999). Although

educational researchers emphasize the role of test reliability as the foundation for all

following inferential work in empirical studies, it is not uncommon for researchers of

emotions to rely on single-item tests for measuring emotions. These researchers

contended that emotion “is more typically construed as within-subject construct (a

state), and we assume that it may change quickly and frequently within any single

individual” (Larsen & Fredrickson, 1999, p. 43). This may lead to the uncertainty in

using classical test theory (CTT) methods (e.g., reliability coefficient) to assess the

quality of a measurement when it is possible to bypass reliability concerns and focus

on the validity instead.

This complicates the goal of including more women and ethnic minorities in

advanced STEM courses because some measures may not be fair. A test is not fair

when two groups of equal levels of a latent trait (e.g., interest) earn different scores on

the same item of the test (Gall et al., 2006). This measurement bias is also termed as

differential item functioning (DIF; Teresi, 2006). With regard to measures of interest,

Schunk et al. (20008) suggested that it might be particularly difficult for some

subgroups to accurately report their affects. Before drawing any conclusions


10

regarding gender or ethnic differences in attitude, interest, or affect, researchers need

to be aware of these possible measurement biases. Although research has indicated

gender differences in mathematics interest (e.g., Evans, Schweingruber, & Stevenson,

2002; Köller, Baumert, & Schnabel, 2001; Marsh, Trautwein, Lüdtke, Köller, &

Baumert, 2005; Renninger, 1992), it is possible that such differences primarily came

from item- and scale-level measurement biases rather than real differences in latent

traits. Moreover, cultural differences have been a frequently studied source of

response bias, as the language of the instrument may be a potential threat to

measurement invariance, particularly for multilingual respondents (Church, 2001;

Ramirez, Teresi, Holmes, Gurland, & Lantigua, 2006).

All measurement issues discussed above call for the utilization of the item

response theory (IRT) in research on mathematics interest and affect. IRT bases its

analysis on some unique assumptions, which allow it to overcome many shortcomings

of the CTT methods (Gall et al., 2006). It is particularly helpful in examining

instruments for which reliability concerns have to be bypassed. Further, unlike the

CTT perspective which assumes the same amount of error for each respondent, IRT

analyses are capable of examining the unique item-respondent interaction that an item

may be identified as being too easy (very low thresholds) or too difficult (very high

thresholds) for a particular respondent. Using IRT, researchers are not only evaluating

the psychometric quality of an instrument in scale- and subscale-levels, but also its

item-level characteristics.

Purpose of the Study The purpose of this dissertation was to evaluate the psychometric properties of

the measure of students’ mathematics interest. The study comprises three phases in

which Self Description Questionnaire I (SDQI; Marsh, 1992a) data from the Early

Child Longitudinal Study-Kindergarten (ECLS-K) were utilized. In the first phase,

the measure was evaluated from the CTT and factor analytic perspectives for

properties to demonstrate factor structure and internal consistency reliabilities. The

association between mathematics interest and mathematics performance was examined


11

to provide evidence of the external validity. In the second phase, the item-level

characteristics were examined within the IRT framework (discussed in Chapter 3). In

the third phase, measurement biases (DIF) across gender, ethnicity, and age were

examined within the IRT framework.

Research Questions To achieve the purpose of this dissertation, the following research questions

were outlined:

1. From both the CTT and IRT frameworks, does the SDQI accurately measure

children’s interest and affect in mathematics?

2. Do items of mathematics interest and affect demonstrate measurement bias

across gender?


across ethnic groups?


across age groups?

5. How do children’s responses to the mathematics interest and affect items

change over time?

Significance of the Study In the literature, two explanations are given for unsatisfactory academic

performance: lack of ability and lack of effort or motivation. Because there is little

that researchers and educators can do about students’ abilities, much effort has been

invested on boosting their academic motivation (Hidi & Harackiewicz, 2000).

However, the unique features of the STEM subjects appear to daunt students from an

early age. As reported by the President’s Council of Advisor on Science and

Technology (2010), many U.S. students and their parents believe that STEM subjects

are “inherently boring, cryptic, or beyond their grasp” (p. 57). As such, it is even

more critical to identify and maintain students’ interest in STEM subjects. Along with


12

the facts that U.S. students’ STEM achievements decline over age (Fleischman et al.,

2010; Provasnik et al., 2012; Schmidt, 2011), research has shown that children’s

interests and attitudes toward specific subject areas such as mathematics tend to

deteriorate as they get older (e.g., Eccles, Wigfield, & Schiefele, 1998). Moreover, the

wide disparities of STEM performance across gender and ethnic groups (NRC, 2011)

also entail in-depth examinations of the measures that researchers use for gauging

interest, attitude, and affect in STEM education.

This dissertation provides a comprehensive examination of a commonly used

measure of mathematics interest and affect—the Self Description Questionnaire I

(SDQI). The SDQI was designed to measure multiple dimensions of self-concept for

pre-adolescents in primarily four non-academic areas (i.e., physical ability, physical

appearance, peer relations, parent relations), and three academic areas (i.e., reading,

mathematics, and school in general; Marsh, 1992a). From a review of extant literature,

the SDQ has been cited in over 450 studies from 2003 to 2013. Thus, a psychometric

examination of this measure with respect to mathematics interest and affect was

particularly warranted. As discussed earlier, the conceptual issues particularly relate

to the face validity of individual items designed to tap interest and affect in academic

settings. For example, with respect to the distinction between personal interest and

situational interest (Schunk et al., 2008), the SDQI item “I enjoy doing work in math”

seems to be more concerned about situational interest, given that it specifies a context

for this interest to be triggered and observed, while a less straightforward item “I

cannot wait to do math each day” seems to be more of a measure of personal interest

as a dispositional trait. We may therefore expect to see varying levels of item

discrimination and threshold. Thus, the IRT analyses of these item-level

characteristics are promising in revealing such operationalization issues that may

interfere with the measurement. Moreover, findings of this study would also

contribute to the knowledge base for future construction and validation of instruments

of academic interest.


13

In addition to examining the psychometric properties of the instrument, the

current study examines relevant differences in mathematics interest and affect

according to group membership (e.g., gender and ethnic group), and across time.

Although gender, ethnic, and age differences are well-documented in the literature

(e.g., Else-Quest et al., 2010), few studies have been conducted for examining the

measurement invariance of these measures. Such examinations of pertinent

instruments are considered critical because they would likely confirm or disconfirm

previous findings regarding gender, ethnic, and age differences as assessed by the

SDQI and similar measures. More importantly, findings with respect to measurement

bias in SDQI were expected to provide practical implications as to how strategies (e.g.,

replacement of items) might be implemented for fairer measures across demographic

groups (Teresi, 2006).

Limitations of the Study The current study is limited in that it focuses on just one measurement of

mathematics interest and affect, the SDQI, and there may be other measures of

mathematics interest that would warrant investigation as well. However, IRT analyses

require large sample sizes (e.g., a sample size of 500 or more is preferable; see Reise

& Yu, 1990; Herrera & Gomez, 2008). Thus, the analysis of other measures was

precluded given the availability of appropriate data to conduct these analyses. To

offset this limitation, the external validity of the measure was examined with respect

to actual achievement to ensure that the measure might be considered authentically

related to achievement.

There are also concerns regarding the face validity and factor structure of

relevant SCQI items within the context of the current study. The SCQI were

hypothesized to measure multiple dimensions of pre-adolescence children’s self-

concept (Marsh, 1992a), in which student ratings of their skills, ability, enjoyment and

interest in mathematics altogether represent one out of the seven unique dimensions

(i.e., physical abilities, physical appearance, reading, mathematics, peer relations,

parent relations, general-self, and general-school). Although items measuring


14

mathematics interest are distinctive in terms of how they are phrased, they were not

distinguishable from items of perceived competence of mathematics as a result of

factor analysis (Marsh & O’Neill, 1984). A possible explanation may be the positive

association between perceived competence and intrinsic pleasure in Harter’s (1981,

1996) model of mastery motivation in children, but it also points out that interest is

somewhat elusive to capture with the presence of related constructs.

Additionally, in examining differences in measurement across time, the

mathematics interest and affect were only measured across two time points (i.e., third-

and fifth-grade). Even though the two time points represent a critical stage in

children’s development of interest, attitude, and affect (Eccles et al., 1998; Wigfield,

Eccles, Schiefele, Roeser, & Davis-Kean, 2006), the data only allowed for the

assessment of the amount of change, but not the pattern of this change. Though

adopting a longitudinal perspective, this study is not strictly longitudinal in nature,

given that any two-wave observation is incapable of providing estimates of the

parameters of the growth curve (Rogosa, 1995).


15

CHAPTER II

REVIEW OF LITERATURE The review of literature is focused on the theoretical perspectives and research

findings on interest with respect to theories of achievement motivation because the

significance of this dissertation lies in the consistent findings that interest is positively

correlated with motivation and performance in academic settings (Schunk et al., 2008).

It begins with a general review of the concept of interest, discusses the role of interest

and related affective experiences in major motivational theories, and introduces the

theoretical basis for investigating gender, ethnic, and developmental differences in

mathematics interest.

The Concept of Interest The topic of interest has a long history that can be traced back to the 1800s,

when Herbart (1806) contended that interest could promote motivation in learning.

The beginning of the 20th century witnessed a surge of scholarly discussions of

interest among theorists. Dewey (1913) believed that an individual interacts with the

environment to raise interest; Thorndike (1935) considered learning to be affected by

the learner’s interest; and Bartlett (1932) added that interest might specifically

facilitate the learner’s memory. Following this surge, however, research on interest

declined when behaviorism became dominant in psychology (Schunk et al., 2008).

Krapp et al. (1992) provided two reasons for this decline: the many and varied

conceptualizations of interest and the development of more discrete research

approaches, which had rendered the concept of interest superfluous. They further

argued that each of these approaches typically focused on a single aspect of interest

such as attention, curiosity, emotion, attitude, and motivation. The situation of

research on interest did not change much when cognitive psychology emerged to

dominate psychological research because the early cognitive theories largely neglected

the motivational processes. It was not until the late 1980s and early 1990s that we saw

a renaissance of the concept of interest as researchers began to integrate motivational

and cognitive variables to better explain learning and performance (Krapp et al., 1992;


16

Schunk et al., 2008). That being said, these research endeavors (e.g., Hidi, Baird, &

Hildyard, 1982; Renninger, 1984, 1989, 1990; Schiefele, Winteler, & Krapp, 1988)

were still based on varied conceptualizations of interest, thus representing different

perspectives and questions about learning. This may explain why research on interest

again has waned in the past two decades.

It appears that the difficulty in such conceptualization lies in the fact that

interest is itself a concept invented and used in everyday life to refer to a certain

psychological phenomenon. When a psychologist wants to investigate a person’s

interest, he will have to assume that the person uses the term “interest” in its more

general shared meaning. It is through this assumption that we created the

psychological construct “interest,” but Valsiner (1992) cautioned that “such

operationalization remains a methodological construction that leads to data derivation

from the phenomena in ways that by their nature eliminate a relevant aspect of the

phenomena from the data” (p. 29). Such confusion is evident in the use of any

common-language meaning as a psychological construct, resulting in a few issues as

pertinent to the conceptualization as well as operationalization of interest. First,

interest is so commonly used in everyday language that researchers may find it

unnecessary or even impossible to assign a specific definition to it. As it turned out,

however, their operationalization exhibited very different patterns and perspectives of

what interest constitutes. Second, disparities exist between interest in its common-

language meaning and interest as a psychological construct. Thus, any self-report

measures of interest may demonstrate a unique validity issue because the assumption

that a respondent interprets the term “interest” in its more general meaning appears

fairly weak.

Personal Interest and Situational Interest Despite the varied conceptualizations, most theorists agree that interest is a

phenomenon that emerges from an individual’s interaction with the environment

(Krapp et al., 1992). With respect to the person-environment interaction, there have

been two related bodies of research on interest. The first body of research


17

concentrates on personal interest, also referred to as individual interest or trait interest,

by analyzing its origins and effects, particularly the effect on learners’ cognitive

performance. The second body of research concentrates on the specific characteristics

of the environment that captures the interest of many learners regardless of their

personal interests. Such situational interest is termed as state interest in other

literature. In summarizing the two bodies of research, Krapp et al. (1992) stated that

“interest as a psychological state, and situation-specific factors that bring about

interest, then, reflect two distinct research approaches for investigating the role of

interest in learning and development” (p. 6). However, this statement appears to be

inaccurate along with the development of interest research in the past two decades.

First of all, some theorists contended that there are three, rather than two,

perspectives of interest: personal interest, situational interest, and interestingness of

the context (Schunk et al., 2008). Krapp et al. (1992) did not distinguish

interestingness of the task from situational interest, which made the investigations of

situational interest less person-oriented. More recent literature (e.g., Renninger &

Hidi, 2002) suggested that both personal interest and situational interest refer to a

psychological state of being interested, and situational interest may translate into

personal interest given certain conditions. Hence, we may draw a distinction between

investigations of person-oriented interest (i.e., personal interest and situational interest)

and investigations of interestingness as a contextual factor, but personal interest and

situational interest actually represent two research areas that are closely related.

Personal interest is considered to be specific to individuals, relatively stable,

and usually associated with positive emotions and values (Krapp et al., 1992). In

modern psychology, various theoretical perspectives are available for investigating

personal interest. In social psychology, for example, interest has been viewed as a

vocationally relevant disposition closely related to the concept of attitude, and

sometimes even defined as attitude (Evans, 1971). On the other hand, process-

oriented theories are more concerned about students’ interest as demonstrated by the


18

levels of psychological arousal such as effortless attention accompanied by positive

affect (Krapp et al., 1992).

In contrast to personal interest, situational interest refers to interest that is

generated primarily by certain conditions or environmental stimuli. From this

perspective, situational interest is not specific to the individual, but tends to be

common across individuals (Krapp et al., 1992). Because there is large variability in

the person-environment interactions in raising and maintaining interest, however, it is

not likely that individuals would experience the same level of arousal given the same

external stimulus. Due to this nature of it, situational interest is unstable and fairly

short-lived as compared with personal interest. To further distinguish situational

interest from personal interest, theorists also provided a perspective regarding the

affective component of interest. While personal interest is usually accompanied by

positive feelings, situational interest may not be as consistently associated with such

positive affect (Iran-Nejad, 1987).

Interest, Affect, and Motivation Figure 2.1 illustrates the relations among the three approaches to interest

research. This structure clearly coincides with the social cognitive theory (discussed

in detail in “Interest and Affect in Motivation”) that it views interest as a personal

factor (i.e., personal interest as a disposition), an environmental factor (i.e.,

interestingness), and a behavioral factor (i.e., psychological arousal through person-

environment interaction). As such, it is appropriate to adopt the social cognitive

perspective in relevant investigations.

Previous studies in different psychological domains (e.g., social psychology,

applied psychology) have examined interest as, or related to constructs including

attention, curiosity, emotion, attitude, and motivation. With its common-meaning in

everyday language, the term “interest” may inevitably give rise to its many and varied

conceptualizations. It is therefore critical for a particular interest study to be specific

about its guiding theoretical perspective. Given the importance of motivation in


19

academic settings, particularly with regard to the underrepresented populations in

STEM fields, this dissertation undertakes the motivational perspective to view interest

and relevant affective experiences as contributing factors of achievement motivation.

Characteristics of the person Psychological state Characteristics of the within the person

learning environment (material/text)

Figure 2.1. Three approaches to interest research. Adapted from The Role of Interest

in Learning and Development (p. 10), by K. A. Renninger, S. Hidi, and A. Krapp

(Eds.), 1992, Hillsdale, NJ: Erlbaum. Copyright 1992 by Lawrence Erlbaum, Inc.

To further complicate the situation of interest research, it is worth noting that

the literature did not provide a clear distinction between interest and affect. According

to the broad definition of affect which involves all subjectively experienced feelings

including arousal, emotions, and moods (Blechman, 1990), it is reasonable to argue

that interest, particularly situational interest, is a certain type of affective experience.

Although a single aspect of interest may resemble other constructs such as attitude

(e.g., Evans, 1971), attention (e.g., Eysenck, 1982), or emotion (e.g., Izard, 1977), the

status of interest as any of these affective constructs is debated. What we are more

certain about is that interest, as a psychological state, tends to be accompanied with a

variety of taxonomies of affect such as physiological changes (Silvia, 2008), cognitive

appraisal (Silvia, 2005) and motivational value (Izard & Ackerman, 2000). The

affective component, however, appears to be more intense and consistent in personal

Personal interest as a disposition

Interestingness

Actualized individual personal interest

Situational interest


20

interest than in situational interest (Krapp et al, 1992). Given the complicated

relations between affective experiences and the concept of interest, affect was

included as one of the peripheral factors of interest under the investigation of this

dissertation.

Interest and Affect in Theories of Achievement Motivation Motivation is defined as “the process whereby goal-directed activity is

instigated and sustained” (Schunk et al., 2008, p. 4). In academic settings, motivation

can influence what, when, and how we learn (Schunk, 1995). Motivation is a

multifaceted construct, and researchers usually gauge it through its behavioral

indicators. Researchers in general agree on the following four behavioral indicators of

motivation: choice of task, effort, persistence, and performance (Schunk et al., 2008).

In fact, the first three may be viewed as the mechanism through which motivation

affects a learner’s learning outcomes or performance. For example, a motivated

learner is more likely to choose challenging tasks (e.g., taking advanced math courses

which are optional), invest more time and effort on the tasks (e.g., studying advanced

mathematics for extensive time, actively seeking help), and persist in working on the

tasks when encountering difficulties (e.g., not quitting the course when failing an

exam) than an unmotivated peer. These behaviors will likely lead to better outcomes

or performance. Because of this, motivation plays a vital role in learning and has

received considerable attention in educational research.

The linkage between achievement motivation and positive learning outcomes

may be explained from either a behavioral or cognitive perspective of learning.

Behaviorism (e.g., Skinner, 1953) views learning as a change of frequency in

behaviors through the A-B-C (Antecedent-Behavior-Consequence) operant

conditioning process, and the foresaid behaviors (e.g., choosing challenging tasks,

investing efforts) would help a learner acquire desirable consequences, which further

reinforce him to learn. Thereby, motivated individuals enhance their learning

experiences by strengthening the stimuli-response (S-R) connection. On the other

hand, cognitivists, particularly constructivists (e.g., Piaget, 1985) contend that learning


21

occurs when a learner actively “construct” knowledge, and such construction is

activated through an adaptation process—assimilation or accommodation (Piaget,

1985). As such, a challenging task is more likely to trigger the adaptation process and

facilitate the construction of knowledge. Finally, social cognitive theorists (e.g.,

Bandura, 1977; Vygotsky, 1978) stress the importance of the interaction between

individuals and their social context. The choice of challenging tasks and help-seeking

behaviors of a motivated learner indicate the active role of learners in organizing

personal and environmental resources in order to produce positive outcomes.

In mentioning the active role of learners, the contribution of social cognitive

theory must be acknowledged. Social cognitive theory (Bandura, 1977, 1986, 1997)

has made an enormous influence in psychology and education over the past three

decades (Zimmerman & Schunk, 2003). In general, Bandura (1977) analyzed human

learning and self-regulation in terms of triadic reciprocal causations, which involve

three types of determinants (i.e., personal, behavioral, and environmental) of learning

(see Figure 2.2). Social cognitive theory “depicted people as self-organizing,

proactive, self-reflective, and self-regulative rather than merely reactive to social

environmental or inner forces” (Zimmerman & Schunk, 2003, p. 439). In comparison,

while behavioral theories of motivation focus on the lower half of the triadic

reciprocality model, which is, the connection between environmental (i.e., stimuli) and

behavioral determinants (i.e., response), the social cognitive perspective of motivation

directs researchers’ attention at the (within) personal determinants including thoughts

and feelings. As Bandura (1986) stated, “what people think, believe, and feel affects

how they behave. The natural and extrinsic effects of their actions, in turn, partly

determine their thought patterns and affective reactions” (p. 25).

The social cognitive perspective of motivation allows researchers to uncover

the cognitive and affective factors, which may explain the individual variance

unaccounted for by merely looking at environmental factors such as classroom setting

and instructional methods. As an example, two classmates may be receiving exactly

the same instruction in the same classroom environment, yet demonstrate varying


22

levels of motivation, because they may differ greatly in how they perceive the

environment, make attributions of their successes and failures, and affectively respond

to certain events. Bandura’s (1986) statement that, “what people think, believe, and

feel affects how they behave” (p. 25) concerns at least two domains of personal

determinants in the triadic reciprocality model: the cognitive domain and the affective

domain. A review of the major theories of motivation to-date, however, reflected an

unbalanced weight on the cognitive domain over the affective domain.

Figure 2.2. Model of triadic reciprocal causations. From Social Foundations of

Thought and Action: A Social Cognitive Theory (p. 24), by A. Bandura, 1986,

Englewood Cliffs, NJ: Prentice Hall. Copyright 1986 by Prentice-Hall Inc.

Expectancy-Value Theory of Motivation The expectancy-value theory of motivation has been one the most prominent

views on the nature of achievement motivation. Its application in educational research

mainly draws from the work of Wigfield, Eccles, and their colleagues (e.g., Eccles et

al., 1983; Wigfield, 1994; Wigfield & Eccles, 2000, 2002; Wigfield et al., 2006).

From a social cognitive perspective, motivation is affected by a variety of factors such

as reinforcement of behavior, learners’ goals, interests, and sense of self-efficacy and

self-determination. In the social cognitive expectancy-value model (see Figure 2.3),

these factors are organized to create two general sources of motivation: learners’

expectation on how well they will do on upcoming tasks (expectancy), and the values

Personal Determinants

Environmental Determinants

Behavioral Determinants


23

(subjective value) they place on these tasks. The relationship between expectancy and

task value is multiplicative, often written in formula as expectancy × value =

motivation (Wigfield & Eccles, 2002). In other words, if a learner scores 0 on either

of the two factors, he will not at all be motivated on the task, no matter how high he

scores on the other factor. According to Wigfield and Eccles (2000), expectancies and

values are assumed to be directly related to achievement motivation on learning tasks,

and a variety of variables are assumed to influence expectancies and values.

Figure 2.3. A social cognitive expectancy-value model of achievement motivation.

From Motivation in Education (p. 51), by D. H. Schunk, P. R. Pintrich, and J. L.

Meece, 2008, Columbus, OH: Merrill. Copyright 2008 by Merrill.

As shown in Figure 2.3, affective memories as a function of social world

factors (e.g., cultural milieu) and cognitive processes (e.g., perceptions of social

environment) are postulated to have immediate influence on task value. Although it

seems intuitive that an individual’s affective experiences with respect to a particular


24

activity or task may impact the values he assigns to the task, affective memories is a

less empirically explored component within the expectancy-value framework (Schunk

et al., 2008). The linkage between affective memories and task value, therefore,

remains largely unknown in the literature.

Interest is among the four components of subject value in the expectancy-value

model: attainment value, intrinsic value (interest), extrinsic/utility value, and cost.

Intrinsic interest is defined as the enjoyment that people experience when performing

a task or their subjective interest in the content of a task (Wigfield & Eccles, 1992).

The investigations of intrinsic interest for predicting motivation and performance have

generally yielded positive results, but interest is mostly analyzed as one subscale along

with attainment and extrinsic values for determining the task value of participants

(Schunk et al., 2008). Locke and Latham (1990) suggested that this conceptualization

of task values represented a rational decision-making process of individuals, which

alludes to the fact that investigations of interest remain largely cognitive.

Attribution Theory of Motivation Weiner’s (1986, 1992) general attributional model of motivation include

psychological consequences as a critical component of the attributional process. As

shown in Figure 2.4, the perceived antecedent conditions include both environmental

factors (e.g., social norms, situational features) and personal factors (e.g., casual

schema, attributional bias), based upon which learners generate their attributions for

factors such as ability, effort, and luck. This process is termed attribution process

(Kelley & Michela, 1980). For example, the gender stereotype that girls lack

mathematical ability (social norms) may lead a girl to attribute their poor performance

in mathematics to their perceived lack of ability, rather than effort. Then, in the

attributional process (Kelley & Michela, 1980), learners’ casual attributions are

expected to produce a series of psychological consequences, including expectancy for

success, self-efficacy, and affect.


25

Antecedent Conditions---------->

Perceive Causes--->

Causal Dimensions->

Psychological Consequences-->

Behavioral Consequences

Environmental factors

Attributions for

Specific information Ability Stability Expectancy for Choice

Social norms Effort Locus success Persistence

Situational features Luck Control Self-efficacy Level of effort

Personal factors Task Affect Achievement

Casual schemas difficulty

Attributional bias Teacher

Prior knowledge Mood

Individual differences

Health

Fatigue, etc.

Attribution Process Attributional Process

Figure 2.4. Overview of the general attributional model. From Motivation in

Education (p. 82), by D. H. Schunk, P. R. Pintrich, and J. L. Meece, 2008, Columbus,

OH: Merrill. Copyright 2008 by Merrill.

Like the expectancy-value model, the attributional model focuses on

expectancy and ability belief factors for predicting motivation and achievement,

although the former distinguishes ability belief (self-efficacy) from expectancy. In

regard to affect, the attributional model indicates an immediate linkage between

affective outcomes and behavioral indicators of motivation. Typical affective

outcomes include emotions such as pride, shame, guilt, anger, and sympathy, and

Weiner (1994) postulated that these emotions would likely contribute to behavioral

consequences of motivation such as choice of behavior and level of effort. However,

a key criticism to Weiner’s (1994) linkage of emotions to motivation is that attribution

theory is cognitive in nature. Rather than dealing with actual emotions, the attribution

theory focuses on the memory of emotions and its accompanying cognitions (Schunk

et al., 2008). As such, emotions within this framework are conceptually close to


26

affective memories in the expectancy value model. In addition, the attributional

model does not seem to encompass a dimension of interest or intrinsic enjoyment of

doing a task.

Goal-Orientation Theories Goal-orientation theories of motivation are focused on why individuals set up

specific goals for themselves and how they approach the tasks (Schunk et al., 2008).

While theorists have proposed different classifications of goal-orientations, the most

commonly cited and researched one has been the classification of mastery goal-

orientation versus performance goal-orientation (Dweck & Legett, 1988). Whereas

mastery/learning goal orientations are concerned about the mastery of knowledge and

skills according to self-set standards, performance goal orientations are more

concerned about how one’s performance is judged in relation to others. Theorists

further added an approach versus avoidance classification. Learners are thereby

classified into four types in the taxonomy: performance approach, performance

avoidance, mastery approach, and mastery avoidance. Dweck and his colleagues (e.g.,

Dweck, 1999; Dweck & Elliott, 1983; Dweck & Legett, 1988) also suggested that

goal-orientations may be a function of implicit theory of intelligence/ability: people

with entity beliefs tend to think their intelligence or ability in a specific domain is

“fixed” and cannot be improved, and those with incremental beliefs tend to think they

are “malleable” and may be enhanced. An attributional process (Weiner, 1986) serves

as the underlying mechanism in this context. An individual may attribute his past

successes of failures according to dimensions of controllability (controllable vs.

uncontrollable), locus (internal vs. external), and stability (stable vs. unstable). When

a learner attributes his past experiences to uncontrollable and stable reasons (e.g., “I’m

not born as a math person, and it won’t change no matter what”), he clearly holds

entity beliefs and will likely adopt a performance goal because deep, meaningful

learning will not improve his intelligence or ability. In contrast, a learner holding

incremental beliefs tend to see deep learning as a way to improve himself and will


27

more likely adopt a mastery approach goal because he believes that his ability is under

his own control.

Given the emphasis on cognitive processes in Weiner’s (1986) attribution

theory, it seems difficult to identify a role of interest in goal-orientation theories.

However, there have been a series of studies (e.g., DeCuir-Gunby, Aultman, & Schutz,

2009; Kaplan & Midgley, 1999; Pekrun, Elliot, & Maier, 2006, 2009; Pekrun, Goetz,

Titz, & Perry, 2002; Seifert, 1995) on the affective antecedents as well as outcomes of

different goal-orientations. In general, these studies concluded that positive emotions

are directly associated with approach goal-orientations, whether being a mastery or

performance-approach, but negative emotions are directly associated with avoidance

goals.

Intrinsic and Extrinsic Motivation The theoretical perspective of intrinsic motivation versus extrinsic motivation

also has a relation to the construct of interest. Intrinsic motivation “refers to

motivation to engage in an activity for its own sake,” and extrinsic motivation refers to

“motivation to engage in an activity as a means to an end” (Schunk et al., 2008, p.

236). Although intrinsic motivation and interest appear to be similar because they

both involve the enjoyment of engaging in an activity or task, theorists cautioned that

“interest is not a type of motivation but rather an influence on motivation” (Schunk et

al., 2008, p. 237). In fact, intrinsic motivation refers to a fairly complex process that

involves many cognitive and affective factors. Nevertheless, the notion that intrinsic

motivation changes over time (Schunk et al., 2008) sheds light on research on interest

in school-age children. Particularly, the overjustification hypothesis (Lepper, Greene,

& Nisbett, 1973) which attempts to account for diminishing intrinsic motivation in

children, analyzed interest as critical: offering learners rewards for academic tasks

conveys the message that their engagements do not reflect their own interests, and

such overjustification of their participation will detriment their intrinsic motivation as

a result. This phenomenon may also be explained by Skinner’s operant conditioning,

in which one’s intrinsic pleasure in doing specific tasks (i.e., antecedent stimuli) is


28

replaced by consequence stimuli: external rewards. Thus, the problem becomes that

rewards are not always stable and controllable, and a learner would no longer be

motivated if the reward is removed.

The Roles of Interest and Affect in Achievement Motivation In review of the roles of interest and affect in major theories of achievement

motivation, several issues emerged to guide this dissertation. Although interest and

affective factors are evident in some of the theories, they are conceptualized

differently in research models. In the expectancy-value theory, interest is among the

four components of task values, which operates as the immediate predictor of

motivation as well as a function of affective memories (see Figure 2.3). The

attribution theory, on the other hand, seems to suggest a more direct connection

between affective outcomes which are primarily emotions and motivation (see Figure

2.4), but there is little mention of the role of interest. Likewise, research guided by

goal-orientation theories has shown that approach goals and avoidance goals may

produce different affective outcomes, but there have been few investigations of how

interest may influence the formation of goal-orientations. Lastly, the classification of

intrinsic and extrinsic motivations emphasizes the role of interest in activating and

sustaining intrinsic motivation. Despite the fact that some theories are missing interest

or affect in the hypothesized models, the positive influence of interest on motivation

and achievement is generally recognized. The role of interest, however, seems of less

importance than the cognitive factors in most of these models. Since Eccles and

Wigfield (1995) validated the factor structure of the 19-item measure of adolescents’

task values and expectancy-related beliefs, their following studies of expectancy-value

theory have mostly relied on this measure, of which only two items are intended to

measure intrinsic interest.

With respect to the investigations of mathematics motivation and achievement

in particular, research has identified motivational factors including academic self-

concept, task value, and outcome expectations as predictors of achievement (e.g.,

Else-Quest, Mineo, & Higgins, 2013). Researchers, however, also proposed that this


29

line of research should do more to incorporate affective variables, given the findings

of their importance in academic motivation and achievement (Else-Quest et al., 2010;

Else-Quest, Hyde, & Hejmadi, 2008; Pekrun et al., 2006, 2009).

Gender, Ethnic, and Age Differences in Mathematics Interest and Affect

Gender Differences The saying “boys will be boys,” whether representing a belief or reality,

appears to be evident in many spheres in our society. In mathematics education,

stereotypes that girls lack mathematical ability persist (Bhana, 2005). Such

stereotypes reflect the reality in the sense that some studies on mathematics

achievement did demonstrate the advantage of being male. For example, Tsui (2007)

found that the mean SAT-Math score among male high school seniors had been

consistently higher than those of their female counterparts, while Xie and Shauman

(2003) indicated that the fraction of males to females who scored in the top 5% in high

school mathematics had remained constant at 2 to 1 over the past 20 years.

Among early developmental theories, Freud’s (1961) psychoanalytic theory

has to do with gender role development starting the phallic stage (3-5 years).

Although this may have limited implications for today’s research and has received

much criticism from feminism theorists, it suggested an early recognition or awareness

of gender roles among children. In light of the social cognitive theory, Bussey and

Bandura (2004) suggested that parents, peers, teachers, the mass media, and various

social institutions altogether operates as the societal system, which contributes to

gender role development of children. The empirical evidence indicated that infants

and toddlers learn to differentiate their gender roles according to their associated

appearance and activities. For example, Bussey and Bandura (1992) found that

children of 3 or 4 years old would already get upset when they are given gender-typed

toys of the other gender. Boys of this age would either refuse to play with, for

example, dolls or other toys requiring feminine, nurturing features, or try to convert

them into toys (e.g., guns) of masculinity. Given that the STEM professions have


30

been traditionally dominated by males in both reality and the mass media (NSF &

NCSES, 2010), it is not surprising that even a female preschooler would shy away

from the STEM related activities in an effort to confirm her feminine identity.

The discussion above is in line with a motivational perspective in explaining

gender differences in the STEM fields. If a gender gap in motivation accounts for a

large proportion of the variance in gender differences in achievement or enrollment

efforts, then the notion of a biological difference/intrinsic aptitude between gender

groups may be invalidated. In fact, there is little evidence endorsing gender

differences in intrinsic aptitude in the educational research (Rivers & Barnett, 2011),

and the notion of a biological difference has been criticized for not having a solid basis

(Spelke, 2005). As Rivers and Barnett (2011) stated, “what we do and how we do it

affect how our children’s brains begin to organize themselves and to process

information” (p. 6). This obviously aligns well with the social cognitive perspective

in motivation: the gender gap in mathematics may be explained by children’s

interactions with their environmental or societal influences. Some researchers (e.g.,

Guiso, Monte, Sapienza, & Zingales, 2008; Riegle-Crumb, 2005), for example, have

attempted to account for such a gender gap by exploring the gender inequities and

stereotypes available in a given culture.

Although the bulk of research has indicated the advantage of being male in

STEM learning and professions, some recent trends are also worth noting. While a

gender gap in achievement still exists as reported by most studies in this field, this gap

has also been found to be slowly closing (Hyde, Lindberg, Linn, Ellis, & Williams,

2010). To further complicate this trend, some cross-cultural or international

comparison studies also indicate a non-uniform pattern of gender differences in the

STEM fields (Else-Quest et al., 2010), with girls demonstrating higher achievement or

motivation (e.g., attitude, self-concept) than boys in some countries or sub-cultures.

These trends may be explained by the constant changes in societal influences such as

family patterns, parental expectations, and level of gender equity.


31

In the past two decades, researchers have become more interested in

motivational factors in mathematics motivation. In a Norwegian sample of middle to

high school students, Skaalvik and Skaalvik (2004) found that male students showed

higher self-concept, performance expectations, intrinsic motivation, and self-

enhancing ego orientations in mathematics than did female students. In terms of the

affective domain, that females have a higher anxiety about mathematics has been

empirically supported by many studies (e.g., Casey, Nuttall, & Pezaris, 1997; Hyde,

Fennema, Ryan, Frost, & Hopp, 1990; McGraw, Lubienski, & Strutchens, 2006).

Furthermore, Simpkins and Davis-Kean (2005) also suggested that girls are less likely

to aspire to careers that are related to mathematics, thus being less likely to take high

school mathematics courses. Given that motivational beliefs and achievement

behaviors are likely to be shaped through gender norms and roles (Jacobs & Simpkins,

2005), it is worth reviewing in the literature with respect to the developmental stage

gender norms and stereotypes start to influence children’s attitudes toward

mathematics.

For this issue, we may need to consult longitudinal studies in this area. Thus

far, much of the research on gender gap has been devoted to identifying the attitudinal

and motivational factors at specific time points of schooling, but only a few studies

have employed longitudinal models to investigate gender differences in the growth

over time. Among these studies, Leahey and Guo (2001) estimated curvilinear growth

models to examine gender differences in mathematics trajectories from elementary

through high school, and concluded that boys had a faster rate of acceleration despite

the relatively equal starting points and slopes. The findings seem to suggest that boys

and girls may demonstrate a gender gap in motivation at a very young age. In fact,

Eccles and her colleagues (Eccles et al., 1983, 1989) have conducted a number of

studies of individuals’ ability beliefs, expectancies for success, and subjective values,

and it was found that, as early as first grade, individuals begin to form clearly distinct

ability-expectancy beliefs and subjective values within the domains of mathematics,

reading, music, and sports.


32

Findings of group comparisons in this area may further complicate the

investigations. In Else-Quest et al.’s (2010) meta-analysis study for identifying cross-

national patterns of gender differences in mathematics, while boys on average reported

to have more positive math attitudes and affect, this statistic also showed large

variability among nations, indicating that girls in some nations (e.g., Thailand)

demonstrated higher motivation for mathematics than boys. Similarly, a comparison

of samples from Australia, Canada, and the United States by Watt et al. (2012)

showed that, while male adolescents held higher intrinsic value for mathematics than

females in the Australian sample, their male counterparts in both Canada and the

United States held higher ability/success expectancy, but not higher intrinsic value

than females. These findings indicate that gender differences in mathematics

motivation may also be a function of culture. Furthermore, gender differences may

also vary according to individual characteristics. For example, Ai (2002) used a large

national data set focusing on students’ development from Grade 7 through Grade 10,

and found that gender differences in mathematics trajectories vary by individuals’

initial status: a gender gap was found only in those who started low in mathematics,

with girls showing a higher initial status but a lower average growth rate than boys. In

conclusion, the literature indicated an early formation of children’s gender stereotypic

beliefs and interest in mathematics, as well as large individual variability due to

demographic characteristics (e.g., culture, starting point).

Ethnic Differences The underachievement of ethnic minority students has been a critical issue in

our education. Many studies have reported that Hispanic and African American

students scored significantly lower than their White and Asian counterparts in

standardized tests (e.g., Bainbridge & Lasley, 2002; Kao & Thompson, 2003; Riegle-

Crumb & Grodsky, 2010), and that such achievement gap emerges in their early

school years (Lee & Burkam, 2003). Similar to how gender stereotypes affect female

students’ interest and affect for mathematics, theorists suggested that minority

students may also be subject to such threat whereas negative academic stereotypes


33

regarding certain ethnic groups influence their motivation and performance (Ogbu,

2003; Steele & Aronson, 1995). In line with this stereotype threat hypothesis,

research has shown that such ethnic differences in mathematics achievement may be

largely attributed to motivational factors. For example, in examining the predictors of

mathematics achievement among White, African American, and Hispanic students,

Byrnes (2003) found that ethnicity accounted for less than 5% of the variance in

achievement when socioeconomic status, exposure to learning resources, and

motivation were controlled.

The motivational perspective is not new in the literature. Many early studies

took this approach to investigating ethnic differences in mathematics-related self-

concept, ability beliefs, attitude, interest, and affect along with gender differences in

these variables. Nonetheless, these early studies were considered to be limited by data

available at the time, as well as the cofounding effect of gender (Oakes, 1990). The

expectancy-value model (Eccles et al., 1983), which many of these investigations were

based on, is also viewed as inadequate for tackling issues of ethnic inequalities

(Riegle-Crumb, Moore, & Ramos-Wada, 2011). As such, researchers concerned about

ethnic inequalities have advocated research that investigates “the path to math” for

different subgroups of men and women, namely the intersection of ethnicity and

gender. For instance, using data from a large, national survey to examine ethnic-

specific gender differences in mathematics motivation, Catsambis (1994) found that,

while all female students tended to demonstrate less interest in mathematics, gender

differences were the largest among Hispanics and smallest among African Americans.

In the Byrnes (2003) study which indicated that ethnic differences in mathematic

achievement were largely accounted for by socioeconomic status, learning resources,

and motivation, however, no ethnic difference was revealed in ability-liking (e.g., “I

like math.”), the interest-related motivational factor in this study.

Further studies continued to yield mixed findings regarding the gender-

ethnicity interaction for explaining the variance in students’ mathematics interest and

affect. Riegle-Crumb et al. (2011), for example, indicated that African American and


34

Hispanic females were doubly disadvantaged because their attitudes were below those

of White males and males of their own ethnic groups. In terms of enjoyment, females

from all ethnicities showed less mathematics enjoyment than White males, with the

exception that only White females reported significantly less enjoyment than their

White male counterparts. Indexing cognitive and emotional engagement in

mathematics with students’ perceived levels of challenge, Martinez and Guzman

(2013) found that while boys across ethnicities reported similar levels of engagement

in mathematics classes, no singular story applied to girls of different backgrounds.

Specifically, Black girls demonstrated lower levels of engagement in mathematics as

compared to girls of other ethnicities, and Asian girls reported the highest, though not

significant, levels of engagement in this domain. Like Byrnes (2003), however, the

Else-Quest et al. (2013) study on mathematic attitudes again disaffirmed the gender-

ethnicity interaction. Their findings indicated a pattern of ethnic similarities in

mathematics attitudes only with the exception of greater mathematics values among

African American students.

The mixed findings of ethnic differences in mathematics clearly relate to the

varied concepts and operationalizations of affective variables in this domain of

research. Given the limited research explicitly dealing with the concept of interest and

the broadness of the concept of affect, this part of the literature review involved a

variety of affective variables such as liking (Byrnes, 2003), attitude (Catsambis, 1994;

Else-Quest et al., 2013; Riegle-Crumb et al., 2011), and emotional engagement

(Martinez & Guzman, 2013), and these variables were all operationalized in different

ways. For legitimate investigations of ethnic differences along with its intersection

with gender, more measures with decent validity need to be made available to the

scholarship. As a prerequisite, the psychometric properties of these measures should

be evaluated carefully with advanced techniques.

Age Differences Although some influential researchers stated that interest has a critical role in

the learning and development of students (Krapp et al., 1992; Renninger, 1992),


35

research is fairly limited on developmental differences in interest (Schunk et al., 2008).

In general, interest is believed to play a more important role in guiding the behaviors

of younger than older children and adults because older children and adults often have

to engage in mandatory tasks that are not of much interest to them (Hidi & Anderson,

1992; Krapp et al., 1992). Unfortunately, research has shown that students’ interest in

academic tasks declines with age, and that interest in mathematics and science tends to

drop the most (Eccles et al., 1998; Harter, 1981; Kahle, Parker, Rennie, & Riley, 1993;

Riegle-Crumb et al., 2011; Tracey, 2002; Tracey & Ward, 1998; Wigfield, 1994;

Wigfield & Eccles, 1992). Overall, these studies indicated that such declines occur as

early as third grade and extends all the way through 12th grade.

Several perspectives were provided for explaining the decline of interest

during school years. Many believe that these changes may be partly due to changes in

the students’ ability perceptions (Barak, 1981; Dweck & Elliot, 1983; Super, 1990;

Wigfield & Eccles, 1992). As individuals grow older, many of them develop entity

beliefs and tend to think that their intelligence or ability in a specific domain is fixed

and cannot be improved anymore. As a result, students who perform relatively poorly

in mathematics tend to devalue this subject area as a defense mechanism to protect

their overall self-esteem. Another factor concerns the school environment, which

tends to become more bureaucratic and controlling as students advance from

elementary to junior high (Eccles & Midgley, 1989). Related to the school

environment factor, the constraints of curricula that do not include choices may also

be responsible for the decline in students’ interest over time (Hoffmann, 2002).

Among the limited research investigating the developmental differences in

interest, Tracey and Ward’s (1998) study on the structure of children’s interest and

competence perceptions is of particular relevance to this dissertation. In evaluating

the Inventory of Children’s Activities (ICA) that was developed to assess interest

according to Holland’s (1985) RIASEC model—Realistic, Investigative, Artistic,

Social, Enterprising, and Conventional—the researchers identified the lack of

similarities of interest and competence structure across the age groups. In particular,


36

the RIASEC circular structure was evident in the college sample, but not the middle

school and elementary school samples as their younger counterparts. As the authors

argued, the failure of the circular model to fit the younger group indicates that the

structure of interest varies with age. Children appeared to respond to the ICA items

using different dimensions than did college students. Whereas college students used

the People/Things and Data/Ideas dimensions in responding, elementary students used

sex typing and locus of activity instead. Middle school students, on the other hand,

seemed to use the dimensions that were related to both college and elementary school

students. Clearly, this measurement study adds to the perspectives discussed earlier

that, in addition to changes in ability beliefs, school environment, and autonomy over

curricula, developmental differences in interest may also be a function of how

individuals’ cognitive interpretation of interest develop over time.

Summary of the Literature Despite the critical role of interest in enhancing learning and performance

(Schiefele et al., 1992), educational research on the topic of interest has not made

significant progress since its renaissance in the early 1990s. The varied

conceptualizations and discrete research approaches (Krapp et al., 1992) may still

prevent interest research from flourishing. That being said, the current trend in

educational research, as guided by the social cognitive perspective, represents a

promising direction for ongoing and future endeavors in this field. Particularly, there

is a need for more research to address the interplay among the three approaches (i.e.,

personal interest, situational interest, and interestingness) to interest research.

As discussed earlier, there is some conceptual fuzziness surrounding the

constructs of interest and affect, which makes it impossible to draw a distinction

between interest and affect. From a review of extant literature, interest may be

equated with an affective variable (e.g., attitude, emotion), indicated by certain

affective outcomes (e.g., liking, positive emotions), or viewed as a function of certain

affect (e.g., affective memories). Some cognitive variables also appear to relate to

interest in motivational theories, among which self-perceived competence or self-


37

efficacy are closely tied to the development of interest (Bandura, 1977; Super, 1990).

There is little doubt that future research shall benefit from a more unified and singular

conceptualization of interest, but considerable work is still required to address the

conceptual fuzziness plaguing interest research. At present, it is simply unwise for

researchers to exclude these closely related variables in their investigations.

Gender, ethnic, and developmental differences in interest are well-documented

in the literature, and these differences are worth careful investigation, given the

underachievement and under-representation of some subpopulations in certain

academic domains. Research on mathematics interest is particularly warranted to

provide implications for enhancing females and minority students’ achievement

motivation and career aspirations in the STEM fields. This line of research, however,

has merely shown us the tip of the iceberg. With respect to gender differences in

mathematics interest, findings appear to be more consistent in that males, particularly

White males, tend to show higher levels of interest or positive affect (Casey et al.,

1997; Catsambis, 1994; Hyde et al., 1990; McGraw et al., 2006; Riegle-Crumb et al.,

2011; Simpkins & Davis-Kean, 2005; Skaalvik & Skaalvik, 2004). When ethnicity is

added into such investigations, findings are mixed that there can be a gender-ethnicity

interaction in predicting mathematics interest (Catsambis, 1994; Martinez & Guzman,

2013; Riegle-Crumb, 2011) or no interaction at all (Byrnes, 2003; Else-Quest et al.,

2013; Tracey & Ward, 1998), and that even in the studies where this interaction is

present, no clear pattern was revealed as to, for example, what subpopulation (e.g.,

Black female) is the most advantaged or disadvantaged. Certainly, the developmental

differences in interest further complicate this topic, directing our attention at the

measurement issues of existing instruments of mathematics interest (e.g., Tracey &

Ward, 1998).

The measure of interest suffers from the nature of interest being both a

psychological and a common language term. This metatheoretical situation of interest

(Valsiner, 1992) calls for the application of advanced techniques in examining the

item-person interaction because, in comparison to the measure of some cognitive


38

variables (e.g., self-efficacy), we are even less certain that an interest item is

interpreted and responded to in the way it is intended. Adding to the significance of

this psychometric evaluation are the gender, ethnic, and developmental differences in

mathematics interest. A key task of this evaluation was to detect whether the

differences among various subpopulations are a function of measurement bias rather

than the true differences in the latent trait.

Research Questions and Hypotheses The following research questions have the corresponding hypotheses from

reviewing extant literature:

1. From both CTT and IRT frameworks, does the SDQI accurately measure

interest and affect in mathematics?

Hypothesis: The SDQI will have sufficient psychometric properties for the

sample as a whole according to CTT and IRT framework in measuring mathematics

interest and affect.


across gender?

Hypothesis: Gender differences will emerge at the item-level as revealed by

differential item functioning analyses in an IRT framework.


across ethnic groups?

Hypothesis: Ethnic group differences will emerge at the item-level as revealed

by differential item functioning analyses in an IRT framework.


across age groups?

Hypothesis: Item differences by age group will emerge at the item-level as

revealed by differential item functioning analyses in an IRT framework.


39

5. How do children’s responses to the mathematics interest and affect items

change over time?

Hypothesis: Item-level responses will change over time as a whole and as a

function of group membership according to gender and ethnic group.


40

CHAPTER III

METHOD

Description of the Data The data for this dissertation were drawn from the Early Childhood

Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K; Tourangeau, Nord, Lê,

Sorongon, & Najarian, 2009), a national longitudinal data set. The ECLS-K focuses

on children’s early school experiences beginning in kindergarten as a multisource,

multimethod study including interviews with parents, data collected from school

principals and teachers, student record abstracts, and direct child assessment.

Beginning in the fall of 1998, the ECLS-K has been collecting data on a nationally

representative cohort of children from kindergarten into middle school. A total of

21,260 children throughout the nation participated. By the time of the present study,

the ECLS-K had completed seven waves of collections: kindergarten-fall (Wave 1),

kindergarten-spring (Wave 2), first grade-fall (Wave 3), first grade-spring (Wave 4),

third grade-spring (Wave 5), fifth grade-spring (Wave 6), and eighth grade-spring

(Wave 7).

The design of the ECLS-K was guided by a framework of children’s

development and schooling that emphasizes the interrelationships among the child, the

family, the school, and the community. Along with data collection from

parents/guardians, teachers, and school administrators, direct child assessment is a

critical study component of the ECLS-K, in which children were asked to participate

in activities designed to measure important cognitive (i.e., literacy, quantitative, and

science) and non-cognitive (i.e., fine motor and gross motor coordination and

socioemotional) skills and knowledge. Beginning with the third-grade collection

(Wave 5), children were also asked to report on their self-perceptions of abilities and

achievements, peer relationships, and problem behaviors. Specifically, children

attending third grade in the spring of 2001-02 were asked to complete a short self-

description questionnaire, part of which was the adapted Self Description

Questionnaire I (SDQI), on how they thought and felt about themselves both


41

academically and socially. The same questionnaire was administered in Wave 6, when

these children were attending fifth grade in the spring of 2003-04 (Tourangeau et al.,

2009).

Sampling Frame The ECLS-K employed a multistage probability sample design to form a

nationally representative sample of children attending kindergarten in 1998-99, with

the primary sampling units (PSU) being geographic areas; the second-stage units

being schools within primary sampling units, and the third- and final-stage units being

children within schools. In the base year (Wave 1), the basic PSU measure of size was

the number of 5-year-olds with modification to facilitate the oversampling of Asian

Pacific Islanders. One hundred PSUs were selected for the ECLS-K, among which 24

PSUs with the largest measure of size were designated self-representing (SR) and

included in the sample with certainty while the remaining PSUs were partitioned into

38 strata of roughly equal size. Next, private and public schools offering kindergarten

programs were selected as the second-stage units, and the third-stage sampling units

were children of kindergarten age, selected within each sampled school (Tourangeau

et al., 2009).

The sample for this dissertation is limited to children assessed in third- and

fifth-grades (Waves 5 and 6) by the SDQI, drawn from the “Early Childhood

Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) kindergarten through

fifth grade Approaches to Learning and Self Description Questionnaire (SDQ) items

and public-use data files,” particularly the SDQ item level data (U.S. Department of

Education, National Center for Education Statistics, 2010). Hence, it is necessary to

discuss in detail the sampling procedures of these two waves. The sample of children

for spring-third grade (Wave 5) consists of all children who were base-year

respondents, and children who were brought into the sample in spring-first grade

(Wave 3) through the sample freshening procedure. While an effort was made to

contact all children enrolled in the base-year schools, slightly over 50% of the base-

year sampled children who had transferred from their kindergarten school were


42

followed for data collection. The subsample of children was the same 50% subsample

of base-year movers flagged for following in spring-first grade, with the addition of

movers whose home language was not English. Following Wave 5, the fifth grade

data collection (Wave 6) excluded 5,214 of the 21,357 children eligible for the study

after the base year for reasons such as mortality or parental refusal to cooperate. The

remaining children were subsampled at different rates depending on the longitudinal

data available (Tourangeau et al., 2009).

Sample While a nationally representative sample of 21,409 children is available for

analysis from the ECLS-K “Kindergarten through fifth grade Approaches to Learning

and Self Description Questionnaire (SDQ) item wording and item-level data,” 6,678

of them were removed from this study because they were missing data throughout all

the SDQI measures in Waves 5 and 6. Among the remaining 14,631 participants, 50.9%

(n = 7,441) reported being male, and 49.1% (n = 7,190) reported being female. In

terms of race, approximately 56.2% (n = 8,226) reported being “White (non-

Hispanic),” followed by 13.1% (n = 1,918) being “Black or African American,” 9.4%

(n = 1,371) being “Hispanic (race not specified),” 8.9% (n = 1,298) being “Hispanic

(race specified),” 6.7% (n = 978) being “Asian,” 2.6% (n = 387) endorsing more than

one race (non-Hispanic), 1.8% (n = 260) being “American Indian or Alaska Native,”

1.2% (n = 174) being “Native Hawaiian or Other Pacific Islander,” and 0.1% (n = 19)

with race data missing. For all ethnicity-related analyses in the current study,

“Hispanic (race not specified)” and “Hispanic (race specified)” were combined as

“Hispanic,” and multiple races, “American Indian or Alaska Native,” and “Native

Hawaiians or Other Pacific Islander” were combined as “Other Ethnicities.”

Instrumentation All measures were obtained from the “ECLS-K kindergarten through fifth

grade Approaches to Learning and Self Description Questionnaire (SDQ) items and

public-use data files” (U.S. Department of Education, National Center for Education

Statistics, 2010). In both third- and fifth-grades, the ECLS-K children were asked to


43

complete the adapted Self Description Questionnaire I (SDQI), which was designed to

measure children’s self-perceptions of their school abilities and interests, peer

relationships, and problem behaviors. The adapted SDQI is a 42-item measure, in

which children self-report their perceived competence and interest in four non-

academic domains (i.e., physical abilities, physical appearance, relations with peers,

and relations with parents) and three academic domains (i.e., reading, mathematics,

and all school subjects). Table 3.1 presents the items of the mathematics subscale. As

shown in the table, there are a total of eight items with ordered response options

ranging from 1 to 4, including not at all true, a little bit true, mostly true, and very true.

Table 3.1

The ECLS-K Adapted SDQI Mathematics Subscale

Item Statement SDQ6 Work in math is easy for me. SDQ12 I cannot wait to do math each day. SDQ16 I get good grades in math. SDQ22 I am interested in math. SDQ26 I can do very difficult problems in math. SDQ30 I like math. SDQ36 I enjoy doing work in math. SDQ41 I am good at math.

Face Validity The Self Description Questionnaire I (SDQI) was initially designed to measure

seven components of self-concept (i.e., physical abilities, physical appearance,

relations with peers, relations with parents, reading, mathematics, and all school

subjects) based on Shavelson’s hierarchical model (Shavelson & Bolus, 1981;

Shavelson, Hubner, & Stanton, 1976). The three academic subscales (i.e., reading,

mathematics, and all school subjects) are each measured by 10 parallel items including

five cognitive statements (e.g., “I get good marks in mathematics”) and five affective

statements (e.g., “I am interested in mathematics”). Children are asked to respond to

each item on a 5-point scale including false, mostly false, sometimes false sometimes


44

true, mostly true, and true. The factor structure of the original SDQI was validated by

its test developers (Marsh, Relich, & Smith, 1983; Marsh, Smith, & Barnes, 1983).

Using two independent samples of fifth and sixth graders, Marsh, Relich, et al. (1983)

conducted factor analysis and identified the seven dimensions that the SDQI was

intended to measure, and an additional factor that was defined by affective items from

all three academic scales. In other words, the five affective items of each academic

domain tended to load together on an eighth factor in addition to loading on their own

domains respectively.

Comparatively, limited psychometric examination has been conducted on the

ECLS-K adapted version of the SDQI. The mathematics subscale (see Table 3.1), in

particular, was adapted from the original version to include only positively worded

statements with slight adaptations in wording (e.g., “I get good grades in math”

adapted from “I get good marks in mathematics”). The adapted subscale still consists

of an equal number of cognitive statements (i.e., SDQ6, 16, 26, and 41) and affective

statements (SDQ12, 22, 30, and 36). In addition, the response scale was changed from

5-point to 4-point with anchors adapted as well. A reasonable a priori consideration of

this study was, therefore, whether the eight items underlie a single dimension of

measurement, leading to the examination of the face validity of items, as well as the

use of factor analysis to explore and confirm the factor structure of the SDQI

mathematics subscale (discussed in detail in Chapter 4).

Face validity, sometimes termed as content validity, refers to the degree to

which a test appears to measure what it claims to measure (Gay et al., 2008). Because

the SDQI was founded on the model of academic self-concept, it is worth discussing

how self-concept relates to other cognitive and affective constructs, particularly

academic interest. Shavelson et al.’s (1976) definition which formed the theoretical

foundation of contemporary self-concept research suggested that self-concept, in very

broad terms, is a person’s perception of himself which may be described as “organized,

multifaceted, hierarchical, stable, developmental, evaluative, and differentiable” (p.

411). Different views exist as to whether self-concept includes emotional reactions


45

such as interest, enjoyment, and satisfaction. While the development of the SDQI

clearly regarded these as a part of self-concept, other researchers (e.g., Eccles &

Wigfield, 1995) have made clear distinctions between ability- or expectancy-related

perceptions and task-value components (Bong & Skaalvik, 2003). Although perceived

competence plays a central role in academic self-concept, some researchers also

argued that self-concept consists of several distinguishable aspects, one of which

being affective in nature (e.g., Bong & Clark, 1999; Scheirer & Kraut, 1979; Skaalvik,

1997). In fact, some studies on the SDQ did indicate that the cognitive dimension and

the affective dimension tend to form separate factors (Skaalvik & Rankin, 1996;

Tanzer, 1996).

Nonetheless, the SDQI mathematics items, particularly the affective statements,

all appear to either assess interest directly (SDQ22—“I am interested in math”) or

reflect some certain aspects of interest that are well-documented in the literature

(discussed in detail in Chapter 2) such as motivation (SDQ12—“I cannot wait to do

math each day”), liking (SDQ30—“I like math”), and positive emotion (SDQ36—“I

enjoy doing work in math”). Furthermore, even the cognitive aspect as perceived

competence was shown to have a reciprocal relationship with academic interest

(Marsh et al., 2005). Wigfield and Eccles (2002) particularly emphasized the role of

perceived competence in developing situational interest to personal interest in school

settings. In summary, the SDQI mathematics items appear to have sufficient face

validity for examining mathematics interest and affect, though this argument was to be

strengthened by results from the factor analysis.

Item Response Theory In the creation and evaluation of psychological assessments, researchers have

traditionally turned to psychometric techniques that provide guidelines to determine

the value of an instrument (Kline, 2000). The traditional models and procedures such

as the classical test theory (CTT) model are based on weak assumptions that can be

met easily by most data sets, but there are also some well-documented shortcomings

associated with these techniques. First, CTT is sample-dependent, meaning that the


46

reliability estimates depend on the sample from which they are derived. Researchers

would, therefore, expect different item characteristics when the measure is used in

different samples. Second, CTT does not take into account the difficulty level or

threshold of responses of a measure for some individuals. If an item appears too

difficult for specific individuals to answer correctly or too high a threshold for specific

individuals to endorse a higher-order response, then the measure will likely be unable

to yield reliable estimates of their true scores on the construct being measured. Third,

CTT assumes that the amount of measurement error will be the same across all items

in a scale, but this assumption is often violated, given the varying item-level

characteristics. Finally, the reliability estimates often require correlating individuals’

responses on an alternative, strictly parallel test, which is often a challenge to test

developers. In addition to these shortcomings, CTT has also failed to provide

satisfactory solutions to many testing problems such as the design of tests and the

identification of biased items (Hambleton & Swaminathan, 1985). Because of this,

considerable attention has been directed at item response theory (IRT) over the past

three decades.

An IRT model specifies a relationship between the observable response and the

unobservable traits assumed to underlie the respondent’s performance on the test, and

this relationship is defined by a mathematical function. Therefore, IRT is in fact a

system of mathematical models that defines the relationship between the latent traits

and their manifestations (de Ayala, 2009). There are three primary advantages of IRT

models over CTT models. First, assuming a large pool of items all measuring the

same trait, the estimate of a respondent’s latent trait is independent of the particular

sample of test items that are administered to the respondent. Second, assuming a large

population of respondents, the estimates of item characteristics are independent of the

particular sample of respondents. Third, a statistic indicating the precision with which

each respondent’s latent trait is estimated is provided, and this statistic is free to vary

from one respondent to another. An additional desirable feature is that the concept of

parallel forms reliability commonly seen in CTT is replaced by the concept of


47

statistical estimation and associated standard errors. The feature of item parameter

invariance, though not unique to IRT, can also be well achieved when the chosen

model fits the data (Hambleton & Swaminathan, 1985).

Assumptions Being mathematical in nature, IRT models include a set of assumptions

including unidimensionality, local dependence, and functional form (de Ayala, 2009;

Hambleton & Swaminathan, 1985). The unidimensionality assumption requires that

the observations on the test items are solely a function of a single continuous latent

trait. In the current study, it means that the SDQI items need to underlie the latent trait

of mathematics interest as a unidimensional rather than multidimensional construct.

This assumption, however, may be viewed as analogous to the homogeneity of

variance assumption that some degree of violation of it may or may not be problematic.

That is, a unidimensional IRT model may provide a sufficiently accurate estimate to

be useful even though the data may in fact underlie two latent traits (de Ayala, 2009).

It is of particular relevance to the current study that the IRT model may still be robust

for estimation because the unidimensionality of the SDQI mathematics subscale,

though well supported by previous validation studies, had yet to be confirmed by

factor analysis in the current study.

A second assumption, local dependence, means that the responses to an item

are independent of the responses to any other item conditional on the person’s location.

In other words, how a person responds to an item needs to be solely a function of his

relatively standing on the latent continuum, but not even how he himself responds to

any other items of the measure (de Ayala, 2009). Violations of this assumption occur

more commonly in cognitive tests where, for example, there are a series of questions

that all relate to the same passage. Considering the affective nature of the SDQI items,

violations of local dependency are unlikely to occur. A third assumption, functional

form assumption, requires the data to follow the function specified by the model. This

assumption is rarely exactly met in practice, and the degree of violation can be

assessed by IRT model fit indices (de Ayala, 2009).


48

Graded Response Model IRT is a system of mathematical models that defines the relationship between

the latent traits and observed indicators. Commonly used IRT models include the

Rasch model, the one-, two-, and three-parameter logistic models for dichotomous

items, and the graded response model and partial credit model for multi-category

scoring (Hambleton & Swaminathan, 1985). Given the ordered response format of the

SDQI items, Samejima’s (1969) two-parameter (2PL) graded response model (GRM)

was employed for the IRT estimates. In the GRM, option characteristic curves

(OCCs) are estimated for each response option in an item. The mathematical formula

with an item of K = 4 ordered responses options (k = 0, 1, 2, 3) may be written as:

𝑃𝑘,𝑖(𝜃) =

⎩⎪⎪⎪⎨

⎪⎪⎪⎧ 1 −

𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖�

1 + 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖�, 𝑘 = 0


1 + 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖� −


1 + 𝑒𝑎𝑖�𝜃 − 𝑏2,𝑖�, 𝑘 = 1


1 + 𝑒𝑎𝑖�𝜃 − 𝑏2,𝑖� −


1 + 𝑒𝑎𝑖�𝜃 − 𝑏3,𝑖�, 𝑘 = 2


1 + 𝑒𝑎𝑖�𝜃 − 𝑏3,𝑖�, 𝑘 = 3

(1)

where Pk,i is the probability of a person’s endorsement of response option k for item i,

θ (theta) is the latent trait, ai is the discrimination parameter for item i, the

mathematical constant e is the base of the natural logarithm, and bk,i is the threshold

parameter (de Ayala, 2009). Figure 3.1 presents a graphical illustration of the OCCs

for a hypothetical item with four ordered options with a = 1.7, b1 = -1.5, b2 = 0, and b3

= 1.5.


49

Figure 3.1. Option characteristic curves (OCCs) for a hypothetical item with four

response options (a = 1.7, b1 = -1.5, b2 = 0, b3 = 1.5).

In the 2-PL GRM, the discrimination parameter (a) indicates the degree of

slope at each point of inflection. According to Baker (1985, 2001), item

discrimination is considered to be very low for a < .34, low for .35 ≤ a ≤ .64, moderate

for .65 ≤ a ≤ 1.34, high for 1.35 ≤ a ≤ 1.69, and very high for a > 1.70. The threshold

parameters b1,i and b3,i represent the values of θ where the probability is 0.5 for

endorsing the lowest (k = 0) and highest options (k = 3), respectively, and the modes

of the other OCCs (k = 1, 2) are specified as (b1,i + b2,i)/2 and (b2,i + b3,i)/2,

respectively. The intersection of any two OCCs otherwise indicates an equal

probability of endorsing option k or higher versus endorsing lower order options (de

Ayala, 2009). As shown in Figure 3.1, the probability is 0.5 for an individual to

endorse Option 0 (not at all true) when his interest for mathematics is 1.5 standard

deviations below the mean, and for Option 3 (very true) when 1.5 standard deviations

above the mean. For OCCs of Options 1 (a little bit true) and 2 (mostly true), the θ

values are (-1.5 + 0)/2 = -0.75 where the mode of OCC for Option 1 is present and (0

+ 1.5)/2 = 0.75 for Option 2.


50

In addition to OCC, item information function (IIF), a function of θ, also

provides valuable insight about the precision of measurement provided by each

specific item in a test. Samejima (1974) defined information for polytomous IRT

model as:

𝐼𝑖(𝜃) = � 𝐼𝑘,𝑖

𝑚𝑖

𝑘,𝑖 = 0

(θ) = ��𝑃𝑘,𝑖′�

2

𝑃𝑘,𝑖

𝑚𝑖

𝑘,𝑖=0

(2)

where Ii is the item information for item i, mi is equal to the number of score points

minus one, Ik,i is item option information of response option k which potentially

contributes to the item information, Pk,i is the same as in Equation 1, and Pk,i’, given K

= 4, is defined as:

𝑃𝑘,𝑖′(𝜃) =

⎩⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎧ −𝑎𝑖 �


�1 + 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖��2� , 𝑘 = 0

𝑎𝑖 �𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖�

�1 + 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖��2 −


�1 + 𝑒𝑎𝑖�𝜃 − 𝑏2,𝑖��2� , 𝑘 = 1


�1 + 𝑒𝑎𝑖�𝜃 − 𝑏2,𝑖��2 −


�1 + 𝑒𝑎𝑖�𝜃 − 𝑏3,𝑖��2� , 𝑘 = 2


�1 + 𝑒𝑎𝑖�𝜃 − 𝑏3,𝑖��2� , 𝑘 = 3

(3)

where ai and bi represent the same parameters as in Equation 1. As such, IIF curves

may also be graphically illustrated as shown in Figure 3.2. Furthermore, the IIF can

be summarized to create test information function, and the amount of information is

influenced by the quality and number of test items. The contribution of each item to

the total information is addictive and depends on how highly each item correlates with

other items in the set (Hambleton & Swaminathan, 1985).


51

Figure 3.2. Item information function (IIF) curve for a hypothetical item with four

response options (a = 1.7, b1 = -1.5, b2 = 0, b3 = 1.5).

Differential Item Functioning Invariance is key component to accurate measurement because the examination

of any group difference is meaningless given poor quality of the measurement tool.

Such measurement bias may emerge out of many situations in research. For instance,

individuals from different cultures may interpret and respond to an item differently,

and individual differences (e.g., gender, age, personality) may trigger the use of

different frames of reference in the responding process as well (Vandenberg, 2002).

Differential item functioning (DIF), also referred to as measurement bias, occurs when

respondents of subgroups with the same latent trait have different probabilities of

endorsing a test item (Holland & Wainer, 1993). DIF may exist in a variety of tests

including attitude and personality tests, as well as cognitive tests when they contain

items that inadvertently assume that the respondents have certain knowledge or a

particular background in order to understand the items as intended. Taken into the

context of the current study, this could mean that, for instance, a boy and a girl with

the same mathematics interest may respond to the corresponding item differently as a


52

function of gender. Likewise, age neutrality may also be examined for measures of

interest because research suggests that, as children age and their sense of

metacognition develops, different interpretations may trigger DIF concerning how

they respond to the measures of interest.

From the IRT perspective, the existence of DIF means that the item’s

parameter estimates (e.g., item discrimination, item threshold) are not invariant across

the manifest groups, namely item-data misfit. As such, there are two forms of DIF in

the IRT framework: uniform DIF which indicates one group performs better than the

other group throughout the continuum, and nonuniform DIF which indicates that one

group performs better than the other group only for a particular portion of the

continuum. Graphically, uniform DIF is represented by parallel OCCs between two

groups, while the OCCs tend to intersect for nonuniform DIF (de Ayala, 2009).

General DIF methods based on IRT include the likelihood ratio (TSW-ΔG2;

Thissen, Steinberg, & Wainer, 1988, 1993), Lord’s Wald test (Lord, 1980), and the

Exact Signed Area and H Statistic (Raju, 1988, 1990), while non-IRT-based

approaches to DIF detection include, but are not limited to, the nonparametric Mantel-

Haenszel Chi-Square (MH) statistic (Holland & Thayer, 1988), log linear modeling

(Mellenbergh, 1982), and the use of logistic regression (Swaminathan & Rogers,

1990). Although these methods may be categorized into IRT-based versus non-IRT-

based or parametric versus nonparametric, it should be noted that the parametric and

nonparametric models are somewhat interrelated given that several parametric models

are related to nonparametric methods. For example, both the Rasch IRT model and

the MH methods examine the odds of item response conditional on a latent trait

estimate or a (weighted) total score (Teresi, 2006).

There are advantages and disadvantages in the use of either parametric or

nonparametric approaches to DIF detection. Parametric, IRT-based methods, though

being statistically more powerful in detecting DIF, may identify DIF as an artifact of

model misspecification, and that very large sampling covariance among parameter

estimates can also cause some concern in such methods (Potenza & Roran, 1995). On


53

the other hand, nonparametric methods are relatively free of model misspecification

and collinearity problems (Bolt, 2002), but they require sufficient data to directly

estimate the regressions of item score on test score. Moreover, IRT-based, parametric

methods can be particularly helpful in detecting nonuniform DIF. Built on the IRT

framework, this dissertation employed the Lord’s Wald test for the examination of

potential DIF in the SDQI mathematics items.

Lord’s Wald Test Lord’s Wald test (1977, 1980) is asymptotically equivalent to the TSW-ΔG2

and compares vectors of IRT item parameters between groups. For a specific item, if

the vectors of its parameters differ significantly between groups, then the OCCs differ

across groups, indicating significant DIF. For a two-group comparison, Lord first

proposed a test to evaluate the significance of DIF for the item threshold (bi) only:

𝑑𝑖 = 𝑏�𝐹𝑖 − 𝑏�𝑅𝑖

�𝑉𝑎𝑟�𝑏�𝐹𝑖� + 𝑉𝑎𝑟�𝑏�𝑅𝑖�

(4)

where 𝑏�𝑅𝑖 and 𝑏�𝐹𝑖 are the maximum likelihood estimates of the parameter bi in the

reference group and the focal group, respectively, and Var(𝑏�𝑅𝑖) and Var(𝑏�𝐹𝑖) are the

corresponding estimates of the sampling variances of 𝑏�𝑅𝑖 and 𝑏�𝐹𝑖. This test was then

extended for differences between the discrimination parameters (ai), becoming a more

general test of the joint difference between [ai, bi] for the two groups:

χ𝑖2 = ν𝑖′� ν𝑖−1

𝑖

(5)

where ν𝑖′ is [𝑎�𝐹𝑖 − 𝑎�𝑅𝑖, 𝑏�𝐹𝑖 − 𝑏�𝑅𝑖], ∑i is the estimate of the sampling variance-

covariance matrix of the differences between the item parameters, and χ2 is the

distribution with two degrees of freedom (Langer, 2008).

The original implementation of the Wald test was not particularly pertinent to

IRT models, and tended to show severe Type I error. The test was recently improved


54

(Cai, 2012; Cai, Thissen, & du Toit, 2011; Langer, 2008) with modern, accurate IRT-

based error estimation. The improved Wald test estimates the covariance matrix using

the supplemented expectation maximization (SEM) algorithm (Cai, 2008), a strategy

for calculating the information matrix when an EM algorithm is used for parameter

estimation.

Langer (2008) introduced a two-stage Wald procedure for detecting DIF. In

the first stage, the mean and standard deviation of the reference group are fixed while

those of the focal group are freely estimated, and all of the item parameters are

constrained equal between groups. In the second stage, the focal mean and standard

deviation are fixed to the values obtained in the first stage. The simulation study

conducted by Woods, Cai, and Wang (2012), however, indicated that the DIF

contamination in the two-stage procedure is likely to produce Type I error inflation

and other inaccuracies, because the focal mean and standard deviation are estimated

from a misspecified model if there is DIF that does not cancel out in the first stage.

Woods et al. (2012) thus recommended using the one-stage procedure (Cai et al.,

2011), in which the mean and standard deviation of the focal group are estimated

simultaneously with estimation of the item parameters, and item parameters are either

constrained equal between groups (anchor items) or free to vary between groups

(studied items).

In mentioning anchor and studied items in DIF analysis, user-specified anchor

items are required in most DIF procedures. Anchor items can be specified based on

prior research or prior testing. DIF analysis without much prior research may benefit

from purification or anchor selection methods. In fact, Woods et al. (2012)

recommended applying Langer’s (2008) two-stage procedure as the prior testing for

anchor selection before the one-stage procedure.

Analysis of Data Prior to analyses, the data were first screened for missingness, multicollinearity,

and univariate and multivariate outliers in SPSS v20. Bivariate correlation


55

coefficients were computed among the eight SDQI items of each wave for determining

multicollinearity (r > .80); univariate outliers were examined using the criterion of |z| >

3; and Cook’s distances were computed for identifying multivariate outliers—Cook’s

distance > 1 (Cook & Weisberg, 1982).

Analyses were conducted in three phases. In Phase 1, analyses were conducted

to address the first research question of this study regarding the psychometric

properties of the SDQI mathematics subscale from both the CTT and factor analytic

perspective. Factor analysis was first performed to examine the dimensionality of the

scale. Based on the established factor structure of the scale, internal consistent

reliabilities were then assessed to add the CTT-based psychometric information of the

scale. Next, the scale’s predictive validity was assessed through regressing

participants’ mathematics achievement scores on their SDQI scores. Finally, group

differences (i.e., gender, ethnicity, and age), as assessed by the current ECLS-K

adapted SDQI summed scores, were examined using factorial analysis of variance

(ANOVA). In phase 2, IRT analyses were performed separately on third- and fifth-

grade full samples to address the first research question regarding the IRT-based

psychometric properties of the scale. In Phase 3, DIF analyses were performed across

gender (to address the second research question) and ethnic groups (to address the

third research question) for each wave, and then finally across third- and fifth-grade

data for detecting IRT parameter drift (to address the fourth and fifth research

questions).

Phase 1: Factor Structure, Reliability, Validity, and Group Differences To determine whether the aforementioned separability (discussed in Chapter 3)

between the affective and cognitive dimensions exists in the SDQI mathematics

subscale, factor analysis was conducted separately on the third- (Wave 5) and fifth-

grade (Wave 6) data. Exploratory factor analysis (EFA) is typically used for

determining the appropriate number of common factors (Brown, 2006). Principal axis

factoring analysis (EFA) with promax rotation was conducted using SPSS v20 on the

eight items separately for Wave 5 and Wave 6 data. Promax rotation allowed for


56

factors to be correlated (Field, 2009), and the assumption was made that, once

multiple factors were extracted (e.g., perceived competence, interest), these factors

would have correlated with each other. To determine the number of factors, both the

eigenvalues (Kaiser, 1960) and the scree plot’s point of inflexion (Field, 2009) were

consulted. A confirmatory factor analysis (CFA) was then conducted in Mplus v6

(Muthén & Muthén, 2010) using weighted least square mean-and-variance adjusted

(WLSMV) χ2 test statistic estimation. CFA is used in later phases of instrument

development after the underlying structure has been established on prior empirical

grounds. The use of the WLSMV estimator was based on Schmitt’s (2011) suggestion

that data from ordered response scales are often not continuous and not normally

distributed. The acceptable model fit for CFA was defined by Hu and Bentler’s (1999)

combinational rules: (a) Comparative Fit Index (CFI) or Tucker-Lewis Index (TLI) >

0.95 and Standardized Root Mean Square Residual (SRMR) < .09, or (b) Root Mean

Square Error of Approximation (RMSEA) < .05 and SRMR < .06. Because Mplus

provides Weighted Root Mean Square Residual (WRMR) instead of SRMR when the

WLSMV estimator is activated, the cutoff value of WRMR < 1.0 (Yu, 2002) was also

consulted.

Reliability of a test refers to the degree to which a test consistently measures

whatever it is measuring (Gay et al., 2008). There are several types of reliability such

as test-retest reliability, equivalent forms reliability, inter-rate reliability, and internal

consistency reliability. The design of the current study only allowed for the

assessment of the internal consistencies of the scale, which were assessed using the

Cronbach’s alpha (α) method in SPSS v20. Criterion-related validity is determined by

relating performance on a test to performance on a second test or other measure (Gay

et al., 2008). A measure of academic achievement in mathematics, already IRT-scaled

in the ECLS-K data set, was utilized in which to determine the criterion-related

validity of the SDQI as a measure of mathematics interest. The variable names of

mathematics IRT scale scores were C5R4MSCL for third-grade and C6R4MSCL for

fifth-grade, and a bivariate regressional analysis was conducted each for third- and


57

fifth-grade for predicting mathematics achievement from children’s SDQI summed

scores on the mathematics subscale. Finally, to detect the group differences across

gender, ethnicity and age groups as reflected by the SDQI summed scores, a 2 × 4

(Gender × Ethnicity) factorial ANOVA was conducted each on third- and fifth-grade

data, and a paired-samples t test was conducted to examine how children’s SDQI

scores changed from third grade to fifth grade.

Phase 2: Item Response Theory (IRT) Analysis The IRT analyses were conducted separately using the third- and the fifth-

grade data in flexMIRT v2. The flexMIRT v2 bases its “graded model calibration”

function on Samejima’s (1969) 2-PL GRM and provides, for each estimation,

parameter estimates (i.e., item discrimination as ai, item threshold as bi or item

intercept as ci), item and test-information function values, and fit indices including

Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), the

Pearson χ2 statistic, the likelihood ratio statistic (labeled G2), the estimated population

discrepancy function value (labeled F0hat), and the RMSEA (Houts & Cai, 2013).

Phase 3: Differential Item Functioning (DIF) Analysis A series of DIF analyses were performed using the IRT-based Lord’s Wald test

in flexMIRT v2. DIF analysis is normally conducted by comparing two groups of

respondents: the reference group (usually the majority) and the focal group (usually

the minority). First, DIF across gender was examined separately in third- and fifth-

grade data with boys as the reference group and girls as the focal group. Next, to

detect DIF across ethnicity, the full samples of third- and fifth-grade data were each

divided into four groups with White being the reference group, and African American,

Hispanic, and Asian being the focal group for each of the reference-focal comparisons.

Finally, the fifth-grade full sample was specified as the focal group to be contrasted

with the third-grade full sample as the reference group for DIF. This phenomenon in

which item parameters may change over repeated administrations is also referred to as

item parameter drift (IPD; Goldstein, 1983). For the IPD detection where the data


58

were longitudinal (i.e., repeated measures) in nature, a special consideration was to

account for the fact that the observations were not independent. Thus far, limited

literature is available for addressing this issue, particularly in the use of the Wald χ2

test for detecting IPD. In the current study, the decision was made to use the sandwich

estimator that corrects the under-estimation of variance (Huber, 1967). A remarkable

advantage of the sandwich estimator is that it provides valid standard errors when the

assumed covariance model of repeated measures is incorrect. Further, the use of the

sandwich estimator is considered to be best suited to balanced longitudinal data with

relatively large sample size and relatively small number of repeated measures (Lipsitz

& Fitzmaurice, 2009), which are exactly the properties of the ECLS-K data utilized in

this study.

The Wald test was employed for determining significant DIF in polytomous

items with the alpha (α) value of .05. For the Wald test, flexMIRT prints each of the

Wald χ2 values with the associated degree of freedom and p-value. Because each DIF

analysis involved multiple items, such multiple testing would normally lead to

concerns of inflated Type I error. Interestingly, in examining the effects of multiple

testing adjustments in various DIF detection methods, Kim and Oshima (2013)

reported that the Type I error rate of the Wald test are well-controlled even before

adjustment. Therefore, multiple testing adjustment was not applied in the current

study. Following Woods’s (2009) recommendations, the DIF analysis of the current

study combined Langer’s (2008) two-stage procedure and Cai et al.’s (2011) one-stage

procedure to optimize the control of Type I error as well as the accuracy of multi-

group estimation. In particular, each DIF analysis comprised of three stages. In the

first stage, a model was fitted wherein the reference group mean and standard

deviation were fixed to 0 and 1, respectively, and the focal mean and standard

deviation were allowed to be freely estimated, with all item parameters constrained

equal between groups. In the second stage, a model was fitted with the focal mean

and standard deviation fixed to the values obtained in the first stage, and items with

non-significant Wald χ2 values were identified as anchors. In the third stage, the focal


59

mean and standard deviation were again freely estimated, with item parameters of the

anchor items being constrained equal between groups, while those of the studied items

were allowed to vary between groups.

To determine the impact of the DIF at the scale level, the test characteristic

curve (TCF) of each group was visually examined. The test characteristic curve is the

functional relation between the true score and the latent trait scale. Given any latent

trait level, the corresponding true score can be found via the test characteristic curve

(Baker, 1985, 2001). To define the test characteristic curve, an integer scoring

function (ISF) is defined that, for K = 4 (k = 0, 1, 2, and 3):

𝐸𝑖(𝜃) = �𝑘𝑃𝑘,𝑖

3

𝑘=1

= 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖�

1 + 𝑒𝑎𝑖�𝜃 − 𝑏1,𝑖�+


1 + 𝑒𝑎𝑖�𝜃 − 𝑏2,𝑖�+


1 + 𝑒𝑎𝑖�𝜃 − 𝑏3,𝑖�

(6)

where Ei(θ) is the expected score for a given polytomous item i with latent trait level θ.

Then, the test characteristic function is simply computed by summing up the ISF of

each item:

𝐸(𝜃) = �𝐸𝑖(𝜃)𝑁

𝑖=1

(7)

where E(θ) is the scale-level expected score with latent trait level θ, and N is the total

number of items in a scale. In the use the Wald test for DIF detection, it is particularly

necessary to examine the test characteristic curves because DIF of opposite directions

may cancel out, in which case the scale score may still be unbiased (Edelen, Thissen,

Teresi, Kleinman, & Ocepek-Welikson, 2006).


60

CHAPTER IV

RESULTS Prior to analyses, the data were first screened for missingness, multicollinearity

and univariate and multivariate outliers in SPSS v20. Table 4.1 summarizes the

ECLS-K variables utilized in the current study along with associated percentages of

missing data and descriptive statistics. In the full sample of 14,631 for the current

study, total missingness was limited to 11.03% of all cells.

Table 4.1

Descriptive Statistics of the ECLS-K Variables Included in the Study (N = 14,631)

Variable Name Description Missing% Min Max M SD

RACE Child Composite Race 0.13% GENDER Child Composite Gender 0.00% MCC5SDQ6 Wave 5 SDQ6 1.57% 1.00 4.00 3.06 0.99 MC5SDQ12 Wave 5 SDQ12 1.57% 1.00 4.00 2.81 1.09 MC5SDQ16 Wave 5 SDQ16 1.61% 1.00 4.00 3.22 0.89 MC5SDQ22 Wave 5 SDQ22 1.58% 1.00 4.00 3.19 0.98 MC5SDQ26 Wave 5 SDQ26 1.58% 1.00 4.00 2.91 1.04 MC5SDQ30 Wave 5 SDQ30 1.57% 1.00 4.00 3.32 0.99 MC5SDQ36 Wave 5 SDQ36 1.59% 1.00 4.00 3.25 0.97 MC5SDQ41 Wave 5 SDQ41 1.59% 1.00 4.00 3.27 0.89 MC6SDQ6 Wave 6 SDQ6 22.88% 1.00 4.00 2.90 0.96 MC6SDQ12 Wave 6 SDQ12 22.88% 1.00 4.00 2.47 1.02 MC6SDQ16 Wave 6 SDQ16 22.88% 1.00 4.00 3.07 0.88 MC6SDQ22 Wave 6 SDQ22 22.89% 1.00 4.00 3.00 0.97 MC6SDQ26 Wave 6 SDQ26 22.89% 1.00 4.00 2.76 0.97 MC6SDQ30 Wave 6 SDQ30 22.88% 1.00 4.00 3.07 1.01 MC6SDQ36 Wave 6 SDQ36 22.88% 1.00 4.00 2.99 0.98 MC6SDQ41 Wave 6 SDQ41 22.88% 1.00 4.00 3.08 0.89

C5R4MSCL Wave 5 Math IRT Scale Score 1.78% 34.56 166.25 98.73 24.71

C6R4MSCL Wave 6 Math IRT Scale Score 22.94% 50.86 170.66 123.69 24.79


61

Next, bivariate correlations were computed for each wave, and the results are

summarized in Tables 4.2 and 4.3. As shown in the tables, multicollinearity is

minimal among the eight items that only one correlation between SDQ30 and 36 in

fifth-grade data goes slightly beyond the .80 cut-off. Finally, using the |z| > 3 criterion

for univariate outliers and Cook’s distance > 1 criterion for multivariate outliers, no

cases were identified as influential cases.

Table 4.2

Bivariate Correlations among the SDQI Items in Third Grade

Item 1 2 3 4 5 6 7 8 1. SDQ6 -- 2. SDQ12 .50* -- 3. SDQ16 .54* .40* -- 4. SDQ22 .52* .63* .49* -- 5. SDQ26 .37* .29* .35* .32* -- 6. SDQ30 .54* .62* .47* .73* .32* -- 7. SDQ36 .54* .63* .50* .72* .33* .77* -- 8. SDQ41 .62* .50* .62* .58* .40* .62* .61* --

*p < .01.

Table 4.3

Bivariate Correlations among the SDQI Items in Fifth Grade

Item 1 2 3 4 5 6 7 8 1. SDQ6 -- 2. SDQ12 .52* -- 3. SDQ16 .63* .44* -- 4. SDQ22 .60* .68* .54* -- 5. SDQ26 .53* .41* .50* .46* -- 6. SDQ30 .60* .66* .53* .79* .46* -- 7. SDQ36 .59* .69* .52* .79* .47* .81* -- 8. SDQ41 .70* .54* .69* .65* .58* .67* .67* --

*p < .01.


62

Phase 1: Factor Structure, Reliability, Validity, and Group Differences

Factor Structure To determine whether the SDQI mathematics subscale satisfied the

unidimensionality assumption of IRT, factor analysis was performed separately on the

third-grade (Wave 5) and fifth-grade (Wave 6) data. EFA on third- and fifth-grade

data yielded similar results: consistent with what was indicated by the point of

inflexion in the scree plot, the analysis extracted one single factor with the eigenvalue

greater than 1. All eight items appeared to have salient factor loadings (λ ≥.40;

Gorsuch, 1997). The Kaiser-Meyer-Olkin (KMO) measure verified the sampling

adequacy for the analysis, KMO = .92 for third-grade and .93 for fifth-grade, which

are “great” according to Hutcheson and Sofroniou’s (1999) criteria. The one-factor

solutions explained 53.49% (Wave 5) and 60.46% (Wave 6) of the variance.

Turning to CFA, however, the one-factor CFA model did not fit the third-grade

data sufficiently well, χ2(20) = 5958.74, p < .0001, CFI = 0.97, TLI = 0.96, RMSEA

= .14, WRMR = 7.17. As such, modification indices (MIs; Sörbom, 1989) were

consulted and the source of poor-fitting was identified to be huge corrected errors

(Brown, 2006). For example, the correlated errors between SDQ16 and SDQ41

produced the largest Δχ2= 1738.27. A similar situation occurred when the one-factor

model was fit to the Wave 6 data, χ2(20) = 7826.73, p < .0001, CFI = 0.97, TLI = 0.96,

RMSEA = .19, WRMR = 9.01, also with considerable correlated errors among items.

Although the factor structure of the scale was not firmly supported by the CFA,

a decision was made to regard the single-factor structure of the scale based on several

considerations. First, the seven-component structure of the original SDQI has been

cross-validated on different samples, and the mathematics items consistently loaded

on one factor (Marsh, Relich, et al., 1983; Marsh, Smith, et al., 1983). It is highly

unlikely that the ECLS-K adaptations would invalidate the established factor structure

of the SDQI. Second, both the eigenvalue > 1 criterion and the scree plot in the EFA

framework were clearly in favor of the one-factor solution over any multiple-factor

solutions. Finally, though the RMSEA and WRMR went beyond the cut-off values,


63

the CFI and TLI were well above the cut-off of 0.95 in both CFA models. Given that

Hu and Bentler’s (1999) combinational rules are considered too stringent and more

appropriate when evaluating statistical significance rather than goodness of model fit

(Marsh, Hau, & Wen, 2004), it is reasonable to argue that no explicit separability is

evident in the scale. Table 4.4 presents the factor loadings of items as assessed by the

one-factor CFA.

Table 4.4

Internal Consistency Reliabilities and Factor Loadings by Item

Item # Third Grade Fifth Grade

Item-Total Correlation

α if Item Deleted

Factor Loading

Item-Total Correlation

α if Item Deleted

Factor Loading

SDQ6 .67 .88 .77 .73 .91 .81 SDQ12 .67 .88 .79 .69 .92 .81 SDQ16 .62 .89 .74 .67 .92 .78 SDQ22 .75 .87 .88 .81 .91 .91 SDQ26 .42 .91 .51 .59 .92 .66 SDQ30 .77 .87 .93 .81 .91 .93 SDQ36 .77 .87 .91 .81 .91 .92 SDQ41 .74 .87 .84 .80 .91 .88 α .89 .92

Internal Consistency Reliability Based on the one-factor structure of the eight items, the internal consistency

reliabilities were assessed using Cronbach’s alpha (α) method, and the scale overall

showed good to excellent internal consistencies (see Table 4.4). Among all items,

SDQ22 (“I am interested in math”), SDQ30 (“I like math”), SDQ36 (“I enjoy doing

work in math”), and SDQ41 (“I am good at math”) showed higher item-total

correlations across grade levels. SDQ26 (“I can do very difficult problems in math”)

stood out as it had particularly lower item-total correlations in both waves.


64

Criterion-Related Validity To determine the criterion-related validity of the scale, children’s mathematics

achievement scores (i.e., Variables C5R4MSCL and C6R4MSCL as shown in Table

4.1) were used as the outcome variables in the analyses. For third- and fifth-grade

data, a linear regression analysis was conducted each to evaluate the prediction of

children’s mathematics performance from the summed scores of the SDQI items. The

relationships were significant for both waves: R2 = .02, adjusted R2 = .02, p < .001 for

third grade and R2 = .06, adjusted R2 = .06, p < .001 for fifth grade. In either wave, the

SDQI summed score appeared to be a significant, positive predictor of children’s

performance, β = .14, p < .001 for third grade and β = .25, p < .001 for fifth grade.

These findings are consistent with the literature that academic interest has a positive

relationship with performance (e.g., Hidi, 2000; Hidi & Harackiewicz, 2000), so it

may be concluded that the SDQI exhibited sufficient criterion-related validity in

predicting children’s mathematics performance.

Group Differences To detect the group differences across gender and ethnicity groups as reflected

by the SDQI summed scores, a 2 × 4 (Gender × Ethnicity) factorial ANOVA was

conducted each on third and fifth-grade data. The ANOVA on the third-grade data

indicated a significant main effect for gender, F(1, 13573) = 128.12, p < .001, partial

η2 = .01, a significant main effect for ethnicity, F(3, 13573) = 26.27, p < .001, partial

η2 = .01, and a significant interaction between ethnicity and gender F(3, 13573) =

3.51, p = .02, partial η2 = .001. As assessed by the SDQI scale score, boys

demonstrated significantly higher mathematics interest (M = 25.75, SD = 5.77) than

girls (M = 24.27, SD = 6.09) across ethnicity in the third grade. In terms of ethnic

differences as assessed using the Hochberg’s GT2 post hoc comparison, both African

American (M = 25.77, SD = 5.99) and Hispanic children (M = 25.51, SD = 5.65)

demonstrated significantly higher levels of interest as compared with White children

(M = 24.67, SD = 6.08), both at p < .001, but no significant difference was found

between White and Asian children (M = 25.16, SD = 5.97; p = .09).


65

The ANOVA on the fifth-grade data yielded similar results regarding gender

differences but very different results regarding ethnic differences. The ANOVA

indicated a significant main effect for gender, F(1, 10637) = 83.89, p < .001, partial

η2 = .01 and a significant main effect for ethnicity, F(3, 10637) = 7.30, p < .001,

partial η2 = .002, but a non-significant interaction between ethnicity and gender F(3,

10637) = 2.53, p = .06, partial η2 = .001. As in third grade, fifth-grade boys

demonstrated significantly higher mathematics interest (M = 23.94, SD = 6.08) than

girls (M = 22.75, SD = 6.23) across ethnicity. In terms of ethnic differences as

assessed using the Hochberg’s GT2 post hoc comparison, Asian American children (M

= 24.10, SD = 5.67) scored significantly higher than White children (M = 23.21, SD =

6.12; p = .001), so were African American children (M = 23.73, SD = 6.65; p = .03).

However, no significant difference was found between White and Hispanic children

(M = 23.29, SD = 6.23; p = .997). Figure 4.1 illustrates the group differences across

the two waves.

Finally, results of the paired-samples t test indicate that children scored

significantly lower on the SDQ mathematics scale in fifth grade (M = 23.35, SD = 6.18)

than third grade (M = 24.98, SD = 5.94), t(11055) = 26.62, p < .001.

Figure 4.1. Group differences as assessed by the SDQI mathematics scale score.


66

Phase 2: Item Response Theory (IRT) Analysis The IRT analyses were conducted separately on the third and fifth-grade full

samples (N = 14,631) in flexMIRT v2. Table 4.5 presents the IRT parameter

estimates of the items by grade level, and Figures 4.2 and 4.3 illustrate the OCCs of

each item as for third and fifth-graders, respectively.

Table 4.5

IRT Parameter Estimates (a, b1, b2, b3) of Third- and Fifth-Grade Full Samples

Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. Third Gradea SDQ6 2.07 0.03 -1.80 0.03 -0.69 0.02 0.16 0.01 SDQ12 2.33 0.03 -1.27 0.02 -0.32 0.01 0.41 0.01 SDQ16 1.77 0.03 -2.36 0.04 -1.08 0.02 0.04 0.02 SDQ22 3.51 0.05 -1.59 0.02 -0.76 0.01 -0.07 0.01 SDQ26 1.04 0.02 -2.25 0.05 -0.75 0.02 0.57 0.02 SDQ30 4.81 0.08 -1.44 0.02 -0.86 0.01 -0.32 0.01 SDQ36 4.23 0.06 -1.55 0.02 -0.81 0.01 -0.16 0.01 SDQ41 2.71 0.04 -1.96 0.03 -0.99 0.02 -0.08 0.01 Fifth Gradeb SDQ6 2.28 0.04 -1.69 0.03 -0.53 0.02 0.53 0.02 SDQ12 2.48 0.04 -1.08 0.02 0.11 0.01 0.98 0.02 SDQ16 1.89 0.03 -2.25 0.04 -0.87 0.02 0.41 0.02 SDQ22 4.32 0.07 -1.50 0.02 -0.53 0.01 0.28 0.01 SDQ26 1.51 0.03 -1.87 0.03 -0.39 0.02 0.93 0.02 SDQ30 5.02 0.08 -1.35 0.02 -0.61 0.01 0.09 0.01 SDQ36 4.68 0.07 -1.43 0.02 -0.53 0.01 0.26 0.01 SDQ41 3.04 0.05 -1.87 0.03 -0.74 0.02 0.32 0.01

a -2loglikelihood = 219576.39, AIC = 219640.39, BIC = 219883.30, G2(4346) =

27264.39, p < .0001, F0hat = 1.86, RMSEA = .02.

b -2loglikelihood = 172007.52, AIC = 172071.52, BIC = 172314.43, G2(3143) =

26971.03, p < .0001, F0hat = 1.84, RMSEA = .02.


67

Figure 4.2. Option characteristic curves of the items in third grade.


68

Figure 4.3. Option characteristic curves of the items in fifth grade.


69

For third-grade data, all eight items demonstrated “very high” discrimination

according to Baker’s criteria (1985, 2001; a > 1.7) except SDQ26 (“I can do very

difficult problems in math”) which had “moderate” discrimination. The average item

thresholds were -1.78 for b1 (SD = 0.36), -0.78 for b2 (SD = 0.21), and 0.07 for b3 (SD

= 0.28). For fifth-grade data, all items again demonstrated “very high” discrimination

except SDQ26 which showed “good” discrimination. The average item thresholds

were -1.63 for b1 (SD = 0.34), -0.51 for b2 (SD = 0.27), and 0.48 for b3 (SD = 0.30).

Thus, it appears that the scale had fairly low item thresholds in assessing third graders’

mathematics interest. In particular, these items seemed incapable of obtaining much

information of children whose interest levels were above the mean (i.e., θ > 0). From

the IRT perspective, these items are of better psychometric qualities in assessing fifth

rather than third-graders’ interest levels. Such comparisons are illustrated in Figure

4.4 where the item information function curves for each item across grade levels are

combined in one graph. In general, an item covers more area under the curve (AUC)

when administered to fifth- than third-graders, particularly to the right side of the

latent trait (θ) continuum. As a result, the scale in total also provides more

information for those who go beyond the mean in their interest levels, as illustrated by

the test information function curves in Figure 4.5. Nonetheless, the scale appears

unbalanced in terms of thresholds even for fifth graders, because it provides sufficient

information from θ = -2 through 0, but little information for θ to go beyond 1.

In terms of item-level characteristics, three out of the eight items appear to

provide particularly more information across waves: SDQ30 (“I like math”), SDQ36

(“I enjoy doing work in math”), and SDQ22 (“I am interested in math”), while three

items of the cognitive domain (i.e., SDQ26 “I can do very difficult problems in math,”

SDQ16 “I get good grades in math,” and SDQ6 “work in math is easy for me”)

provide much less information when examined using the same metrics (see Figure 4.4).

Comparatively, SDQ12 (“I cannot wait to do math each day”) of the affective domain

and SDQ 41 (“I am good at math”) of the cognitive domain contain moderate levels of

information.


70

Figure 4.4. Item information function (IIF) curves by item across grade levels.


71

Figure 4.5. Test information function curves by scale across grade levels.

Phase 3: Differential Item Functioning (DIF) Analysis A series of DIF analyses were performed using the IRT-based Lord’s Wald test

in flexMIRT v2, and the results of these analyses are summarized in Table 4.6.

Because multi-testing adjustment was considered to be unnecessary for the Wald test,

the significance of DIF is determined by the Wald χ2 with 4 degrees of freedom. For

DIF across gender, all eight items demonstrated significant DIF in third grade while

five of them had DIF in fifth grade. For DIF across ethnicity with White as the

reference group, all items demonstrated significant DIF where the focal group was

African American or Hispanic in both grade levels, but the DIF was relatively

moderate when Asian was the focal group. In addition, all eight items demonstrated

significant item parameter drift (IPD) over time.


72

Table 4.6

Summary of the DIF Results by Gender, Ethnicity, and Age

Item

Lord's Wald Test χ2(df = 4)

Male vs. Female

White vs. African

American

White vs.

Hispanic White

vs. Asian Item

Parameter Drift

Third Grade Third vs.

Fifth SDQ6 86.1*** 87.4*** 56.5*** 11.2* 189.7*** SDQ12 101.4*** 151.9*** 115.6*** 10.2* 255.5*** SDQ16 15.4** 88.6*** 87.8*** 43.2*** 102.5*** SDQ22 25.7*** 322.1*** 75.4*** 2.1 191.5*** SDQ26 100.0*** 196.0*** 28.4*** 26.4*** 449.3*** SDQ30 20.0*** 97.2*** 43.8*** 1.8 138.6*** SDQ36 53.8*** 186.4*** 76.9*** 9.5* 139.5*** SDQ41 47.7*** 128.4*** 46.3*** 19.9*** 133.8***

Fifth Grade SDQ6

98.3*** 73.3*** 25.4*** 3.3

SDQ12

77.5*** 60.9*** 53.0*** 14.9** SDQ16

6.1 79.3*** 124.4*** 30.5***

SDQ22

6.9 91.4*** 43.0*** 1.2 SDQ26

115.7*** 123.2*** 36.7*** 10.7*

SDQ30

0.2 115.0*** 13.5** 0.8 SDQ36

9.8* 139.6*** 28.3*** 7.4

SDQ41 36.4*** 60.2*** 44.1*** 16.0** *p < .05. **p < .01. ***p < .001.

DIF across Gender To detect DIF across gender, the full samples of third and fifth-grade data were

each divided into two groups according to gender. For third graders, no item qualified

as the anchor, so all items were constrained equal between groups. Table 4.7

summarizes the IRT estimates of each group, and Figure 4.6 illustrates the DIF across

gender by item. All eight items showed significant DIF that varied in direction: SDQ6,

16, 26, and 41 were in favor of the reference group while SDQ12, 22, and 36 were in


73

favor of the focal group. In other words, boys were more likely than girls to endorse

higher-order options of items such as SDQ6 (“Work in math is easy for me”) even

given the same level of the latent trait, while girls were more likely to endorse higher-

order options of items such as SDQ12 (“I cannot wait to do math each day”). The DIF

of SDQ30 appears to be minimal between groups because the DIF in b1 and b2 tended

to cancel out. Interestingly, these DIF results are domain-specific: cognitive (i.e.,

perceived competence) items were consistently in favor of boys, while affective (i.e.,

interest, liking, and positive emotion) items were consistently in favor of girls. On

average, these items showed higher discrimination for girls (Ma = 3.00, SDa = 1.33)

than boys (Ma = 2.75, SDa = 1.14). As shown in Figure 4.6, the DIF of some items

(e.g., SDQ12) was nonuniform because it appeared to be a function of invariance in

item discrimination (a) as well as item thresholds (b).


74

Table 4.7

Parameter Estimates of Multigroup (Male vs. Female) IRT in Third Grade

Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. Male SDQ6 2.01 0.05 -1.99 0.04 -0.95 0.02 -0.06 0.02 SDQ12 2.22 0.05 -1.31 0.03 -0.37 0.02 0.36 0.02 SDQ16 1.82 0.04 -2.39 0.06 -1.22 0.03 -0.13 0.02 SDQ22 3.47 0.07 -1.64 0.03 -0.86 0.02 -0.17 0.02 SDQ26 1.05 0.03 -2.43 0.07 -1.04 0.04 0.25 0.03 SDQ30 4.64 0.11 -1.52 0.03 -0.98 0.02 -0.44 0.02 SDQ36 4.07 0.09 -1.58 0.03 -0.91 0.02 -0.27 0.02 SDQ41 2.69 0.06 -2.07 0.04 -1.16 0.02 -0.28 0.02 Female SDQ6 2.19 0.05 -1.82 0.03 -0.69 0.02 0.12 0.02 SDQ12 2.64 0.05 -1.42 0.02 -0.51 0.02 0.18 0.02 SDQ16 1.77 0.04 -2.49 0.05 -1.16 0.03 -0.04 0.02 SDQ22 3.74 0.07 -1.72 0.03 -0.88 0.02 -0.21 0.02 SDQ26 1.04 0.03 -2.3 0.06 -0.71 0.03 0.64 0.04 SDQ30 5.17 0.12 -1.54 0.02 -0.96 0.02 -0.44 0.01 SDQ36 4.64 0.1 -1.68 0.02 -0.93 0.02 -0.31 0.01 SDQ41 2.82 0.06 -2.03 0.03 -1.05 0.02 -0.12 0.02

Note. -2loglikelihood = 218656.60, AIC = 218784.60, BIC = 219270.42, G2(5339) =

32825.73, p < .0001, F0hat = 2.24, RMSEA = .02.


75

Figure 4.6. Option characteristic curves of multigroup (male vs. female) IRT in third

grade.


76

For fifth grade, SDQ16, 22, and 30 were specified as the anchor items, and

other items were freely estimated in the multigroup IRT. Table 4.8 summarizes the

IRT estimates of both groups, and Figure 4.7 illustrates the DIF across gender by item.

Five of the eight items showed significant DIF that varied in direction: SDQ6, 26, and

41 were in favor of boys, while SDQ12 and 36 were in favor of girls. As in third

grade, the DIF still appeared to be domain-specific that cognitive items were

consistently in favor of boys, while affective items were consistently in favor of girls.

On the other hand, SDQ16 (“I get good grades in math”), 22 (“I am interested in

math”), and 30 (“I like math”) became invariant across gender in fifth grade.

Table 4.8

Parameter Estimates of Multigroup (Male vs. Female) IRT in Fifth Grade

Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. Male SDQ6 2.32 0.05 -1.82 0.04 -0.74 0.02 0.32 0.02 SDQ12 2.37 0.06 -1.08 0.03 0.11 0.02 1.01 0.03 SDQ16 1.92 0.05 -2.34 0.06 -0.97 0.03 0.29 0.02 SDQ22 4.32 0.10 -1.54 0.03 -0.61 0.02 0.21 0.02 SDQ26 1.53 0.04 -2.07 0.05 -0.64 0.03 0.67 0.03 SDQ30 5.10 0.12 -1.41 0.03 -0.68 0.02 0.01 0.02 SDQ36 4.63 0.10 -1.46 0.03 -0.60 0.02 0.19 0.02 SDQ41 3.03 0.08 -1.95 0.04 -0.88 0.02 0.16 0.02 Female SDQ6 2.32 0.05 -1.69 0.04 -0.47 0.02 0.58 0.02 SDQ12 2.80 0.06 -1.19 0.03 -0.05 0.02 0.76 0.02 SDQ16 1.90 0.05 -2.26 0.05 -0.91 0.03 0.36 0.03 SDQ22 4.48 0.10 -1.57 0.03 -0.60 0.02 0.18 0.02 SDQ26 1.53 0.04 -1.81 0.04 -0.30 0.03 1.03 0.04 SDQ30 5.09 0.12 -1.41 0.02 -0.69 0.02 0.01 0.02 SDQ36 4.93 0.11 -1.51 0.03 -0.61 0.02 0.16 0.02 SDQ41 3.14 0.07 -1.89 0.04 -0.75 0.02 0.32 0.02


32165.94, p < .0001, F0hat = 2.20, RMSEA = .02.


77

Figure 4.7. Option characteristic curves of multigroup (male vs. female) IRT in fifth

grade.


78

Finally, Figure 4.8 illustrates the test characteristic curves for the two groups in

each grade level.

Figure 4.8. Test characteristic function curves of multi-group (male vs. female) IRT.

DIF across Ethnicity Table 4.9 summarizes the IRT estimates of each ethnic group in third grade.

For third-grade White versus African American comparison (see Figure 4.9), no item

qualified as the anchor, so all items were constrained to be equal between groups. As

shown in Table 4.5, all items demonstrated significant DIF. In particular, the item

thresholds (i.e., b1, b2, and b3) of SDQ12, 26, 30, and 36 were consistently lower for

the focal group than the reference group, indicating that African American children

were more likely than their White counterparts to endorse higher-order options of

these items, while the item thresholds of SDQ16 were consistently lower for the White

group. For SDQ6, 22, and 41, the White group showed lower b1 and b2 but higher b3,

indicating that these items encompassed wider ranges along the θ continuum for White

respondents than African American ones. Adding to the complexity is that the SDQI

items on average showed much lower discrimination (Ma = 2.32, SDa = 0.87) among

African American respondents than among their White counterparts (Ma = 3.06, SDa =

1.38), resulting in the mostly nonuniform DIF among items (see Figure 4.9).


79

For third-grade White versus Hispanic comparison (see Figure 4.10), no item

qualified as the anchor, so all items were constrained equal between groups. Again,

all eight items showed significant DIF. In particular, the item thresholds of SDQ12,

22, 30, and 36 were consistently lower for Hispanic respondents than White

respondents, while the item thresholds of SDQ16, 26, and 41were consistently lower

for the White group. For SDQ6, the White group had lower b1 and b2 but higher b3.

As in the White versus African American comparison, the SDQI items on average had

lower discrimination (Ma = 2.66, SDa = 1.02) among Hispanic respondents than

among their White counterparts (Ma = 3.06, SDa = 1.38). As shown in Figure 4.10, the

DIF of items was mostly nonuniform.

For third-grade White versus Asian American comparison (see Figure 4.11),

SDQ6, 22, and 30 were specified as the anchor items, and other items were freely

estimated between groups. Six of the eight items demonstrated significant DIF. In

particular, the item thresholds of SDQ12 and 36 were consistently lower for Asian

American respondents, while the item thresholds of SDQ16, 26, and 41 were

consistently lower for White respondents. For SDQ6, the White group had lower b1

and b2, but higher b3. Unlike that of previous comparisons, the items on average had a

similar level of discrimination (Ma = 3.14, SDa = 1.18) among Asian American

respondents as compared with their White counterparts (Ma = 3.06, SDa = 1.38). The

DIF of some items (e.g., SDQ26) still appeared to be nonuniform because of the joint

influence of DIF in a and b parameters.


80

Table 4.9 Parameter Estimates of Multigroup (White vs. African American, White vs. Hispanic, and White vs. Asian American) IRT in Third Grade Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. White SDQ6 2.14 0.04 -1.77 0.03 -0.67 0.02 0.26 0.02 SDQ12 2.57 0.05 -1.14 0.02 -0.17 0.02 0.55 0.02 SDQ16 1.77 0.04 -2.53 0.06 -1.17 0.03 0.04 0.02 SDQ22 4.15 0.08 -1.50 0.02 -0.68 0.02 0.01 0.01 SDQ26 1.13 0.03 -2.16 0.06 -0.70 0.03 0.62 0.03 SDQ30 5.22 0.11 -1.34 0.02 -0.77 0.02 -0.21 0.01 SDQ36 4.76 0.09 -1.45 0.02 -0.72 0.02 -0.05 0.01 SDQ41 2.76 0.06 -1.98 0.04 -1.01 0.02 -0.01 0.02 African Americana SDQ6 1.96 0.09 -1.5 0.07 -0.54 0.05 0.14 0.04 SDQ12 1.97 0.09 -1.21 0.06 -0.37 0.04 0.29 0.04 SDQ16 1.83 0.08 -1.85 0.09 -0.78 0.05 0.09 0.04 SDQ22 2.46 0.10 -1.47 0.07 -0.67 0.04 -0.03 0.04 SDQ26 0.71 0.05 -2.42 0.19 -0.92 0.10 0.50 0.08 SDQ30 3.80 0.19 -1.37 0.06 -0.79 0.04 -0.33 0.04 SDQ36 3.10 0.14 -1.52 0.07 -0.82 0.04 -0.21 0.04 SDQ41 2.70 0.13 -1.53 0.07 -0.74 0.05 -0.10 0.04 Hispanicb SDQ6 2.08 0.08 -1.68 0.06 -0.52 0.03 0.19 0.03 SDQ12 2.05 0.08 -1.31 0.05 -0.40 0.03 0.35 0.03 SDQ16 1.92 0.08 -1.94 0.08 -0.73 0.04 0.23 0.03 SDQ22 3.20 0.12 -1.52 0.05 -0.71 0.03 -0.02 0.03 SDQ26 1.01 0.05 -2.12 0.12 -0.62 0.06 0.64 0.05 SDQ30 4.31 0.18 -1.41 0.05 -0.81 0.03 -0.30 0.03 SDQ36 3.80 0.14 -1.46 0.05 -0.75 0.03 -0.14 0.03 SDQ41 2.90 0.11 -1.75 0.06 -0.75 0.03 0.06 0.03 Asian Americanc SDQ6 2.34 0.15 -1.64 0.10 -0.50 0.05 0.25 0.05 SDQ12 2.40 0.14 -1.31 0.08 -0.32 0.05 0.48 0.05 SDQ16 2.21 0.14 -1.77 0.12 -0.68 0.06 0.28 0.05 SDQ22 3.88 0.23 -1.55 0.08 -0.67 0.05 0.02 0.04 SDQ26 1.48 0.10 -1.80 0.14 -0.37 0.06 0.67 0.07 SDQ30 5.23 0.38 -1.40 0.07 -0.78 0.05 -0.24 0.04 SDQ36 4.32 0.27 -1.46 0.08 -0.76 0.05 -0.12 0.04 SDQ41 3.27 0.20 -1.70 0.10 -0.73 0.05 0.06 0.04

a-2loglikelihood = 150540.10, AIC = 150668.10, BIC = 151130.48, G2(3676) = 23806.61, p < .0001, F0hat = 2.35, RMSEA = .02. b-2loglikelihood = 162351.31, AIC = 162479.31, BIC = 162946.26, G2(4014) = 25221.42, p < .0001, F0hat = 2.32, RMSEA = .02. c-2loglikelihood = 137068.92, AIC = 137196.92, BIC = 137653.07, G2(3268) = 20830.12, p < .0001, F0hat = 2.26, RMSEA = .02.


81

Figure 4.9. Option characteristic curves of multigroup (White vs. African American

[AA]) IRT in third grade.


82

Figure 4.10. Option characteristic curves of multigroup (White vs. Hispanic) IRT in

third grade.


83

Figure 4.11. Option characteristic curves of multigroup (White vs. Asian American

[Asian]) IRT in third grade.


84

Table 4.10 summarizes the IRT estimates of each ethnic group in fifth grade.

For fifth-grade White versus African American comparison (see Figure 4.12), no item

qualified as the anchor, and all items showed significant DIF. In particular, the item

thresholds (i.e., b1, b2, and b3) of SDQ12, 22, and 36 were consistently lower for the

African American respondents, while the item thresholds of SDQ6 and 16 were

consistently lower for the White group. On the other hand, SDQ26, 30, and 41 had

lower b1 (and b2) but higher (b2 and) b3 for the reference group, indicating that these

items encompassed wider ranges of the θ continuum for White than African American

respondents. In addition, the eight items on average showed much lower

discrimination (Ma = 2.54, SDa = 0.89) among African American respondents as

compared with their White counterparts (Ma = 3.27, SDa = 1.37).

For fifth-grade White versus Hispanic comparison (see Figure 4.13), no item

qualified as the anchor, and again all items showed significant DIF. In particular, the

item thresholds (b1, b2, b3) of SDQ12, 22, 30, and 36 were consistently lower for

Hispanic than White respondents, while the item thresholds of SDQ6, 16, 26, and 41

were consistently lower for the White group. In terms of item discrimination, the

SDQI items on average showed a similar level of discrimination (Ma = 3.10, SDa =

1.18) among Hispanic respondents as compared with their White counterparts (Ma =

3.27, SDa = 1.37).

For fifth-grade White versus Asian American comparison (see Figure 4.14),

SDQ6, 22, 26, 30 and 36 were specified as the anchor items, and four items showed

significant DIF. In particular, the item thresholds (b1, b2, b3) of SDQ12 were

consistently lower for Asian than White respondents, while the item thresholds of

SDQ16, 26, and 41 are consistently lower for the White group. In terms of item

discrimination, the items on average show similar levels of discrimination (Ma = 3.33,

SDa = 1.22) among Asian respondents as compared with their White counterparts (Ma

= 3.27, SDa = 1.37).


85

Table 4.10 Parameter Estimates of Multigroup (White vs. African American, White vs. Hispanic, and White vs. Asian American) IRT in Fifth Grade Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. White SDQ6 2.29 0.05 -1.76 0.04 -0.57 0.02 0.56 0.02 SDQ12 2.62 0.06 -0.98 0.02 0.23 0.02 1.12 0.02 SDQ16 1.87 0.04 -2.46 0.06 -1.04 0.03 0.36 0.02 SDQ22 4.63 0.10 -1.46 0.03 -0.49 0.02 0.35 0.02 SDQ26 1.58 0.04 -1.90 0.05 -0.43 0.02 0.95 0.03 SDQ30 5.22 0.11 -1.32 0.02 -0.56 0.02 0.18 0.02 SDQ36 5.03 0.11 -1.36 0.02 -0.47 0.02 0.36 0.02 SDQ41 2.99 0.07 -1.97 0.04 -0.82 0.02 0.34 0.02 African Americana SDQ6 2.05 0.10 -1.44 0.08 -0.28 0.05 0.59 0.05 SDQ12 1.96 0.10 -1.12 0.07 0.10 0.05 1.01 0.06 SDQ16 1.72 0.09 -2.06 0.11 -0.61 0.06 0.48 0.06 SDQ22 3.37 0.18 -1.52 0.07 -0.51 0.05 0.21 0.05 SDQ26 1.14 0.06 -1.90 0.12 -0.43 0.07 0.86 0.08 SDQ30 3.76 0.19 -1.30 0.06 -0.62 0.05 0.04 0.04 SDQ36 3.46 0.17 -1.43 0.07 -0.54 0.05 0.19 0.05 SDQ41 2.91 0.15 -1.63 0.08 -0.55 0.05 0.31 0.05 Hispanicb SDQ6 2.29 0.09 -1.56 0.06 -0.37 0.04 0.64 0.04 SDQ12 2.36 0.09 -1.20 0.05 0.00 0.03 0.92 0.04 SDQ16 2.11 0.08 -1.81 0.07 -0.50 0.04 0.68 0.04 SDQ22 3.70 0.13 -1.56 0.05 -0.54 0.03 0.31 0.03 SDQ26 1.47 0.06 -1.73 0.08 -0.18 0.04 1.12 0.06 SDQ30 5.07 0.20 -1.38 0.05 -0.60 0.03 0.08 0.03 SDQ36 4.54 0.17 -1.53 0.05 -0.56 0.03 0.23 0.03 SDQ41 3.27 0.13 -1.69 0.06 -0.54 0.03 0.48 0.03 Asian Americanc SDQ6 2.50 0.16 -1.65 0.12 -0.47 0.06 0.61 0.06 SDQ12 2.43 0.16 -1.21 0.09 0.12 0.05 0.96 0.06 SDQ16 2.34 0.16 -1.89 0.14 -0.58 0.06 0.52 0.06 SDQ22 4.72 0.29 -1.47 0.09 -0.45 0.05 0.34 0.04 SDQ26 1.61 0.11 -1.64 0.14 -0.21 0.07 1.05 0.08 SDQ30 4.93 0.32 -1.35 0.08 -0.57 0.05 0.18 0.04 SDQ36 4.75 0.31 -1.48 0.09 -0.48 0.05 0.29 0.04 SDQ41 3.35 0.23 -1.74 0.12 -0.57 0.06 0.48 0.05

a-2loglikelihood = 116548.76, AIC = 116676.76, BIC = 117139.13, G2(2725) = 23693.87, p < .0001, F0hat = 2.34, RMSEA = .03. b-2loglikelihood = 128842.10, AIC = 128970.10, BIC = 129437.05, G2(3017) = 24021.21, p < .0001, F0hat = 2.20, RMSEA = .03. c-2loglikelihood = 108189.86, AIC = 108317.86, BIC = 108774.01, G2(2426) = 19911.69, p < .0001, F0hat = 2.16, RMSEA = .03.


86

Figure 4.12. Option characteristic curves of multigroup (White vs. African American)

IRT in fifth grade.


87

Figure 4.13. Option characteristic curves of multigroup (White vs. Hispanic) IRT in

fifth grade.


88

Figure 4.14. Option Characteristic Curves of multi-group (White vs. Asian American

[Asian]) IRT in fifth grade.


89

Finally, Figure 4.15 illustrates the test characteristic curves for the two-group

comparisons in each grade level.

Figure 4.15. Test characteristic function curves of multigroup (White vs. African

American, White vs. Hispanic, and White vs. Asian American) IRT.


90

Item Parameter Drift To detect item parameter drift (IPD) from third to fifth-grade, the fifth-grade

full sample data were contrasted with the third-grade full sample data using the

sandwich estimator for standard errors. Table 4.11 summarizes the IRT estimates of

each wave of measurement.

Table 4.11

Parameter Estimates of Item Parameter Drift Analysis

Item ai s.e. b1,i s.e. b2,i s.e. b3,i s.e. Third Grade SDQ6 2.07 0.04 -1.80 0.04 -0.69 0.03 0.16 0.02 SDQ12 2.33 0.04 -1.27 0.03 -0.32 0.02 0.41 0.02 SDQ16 1.77 0.03 -2.36 0.04 -1.08 0.03 0.04 0.02 SDQ22 3.51 0.07 -1.59 0.03 -0.76 0.03 -0.07 0.02 SDQ26 1.04 0.02 -2.25 0.05 -0.75 0.03 0.57 0.03 SDQ30 4.81 0.13 -1.44 0.03 -0.86 0.03 -0.32 0.02 SDQ36 4.23 0.10 -1.55 0.03 -0.81 0.03 -0.16 0.02 SDQ41 2.71 0.05 -1.96 0.03 -0.99 0.03 -0.08 0.02 Fifth Grade SDQ6 2.50 0.62 -1.85 0.17 -0.79 0.06 0.18 0.26 SDQ12 2.72 0.72 -1.30 0.05 -0.20 0.18 0.59 0.33 SDQ16 2.07 0.50 -2.36 0.29 -1.11 0.02 0.07 0.24 SDQ22 4.74 1.46 -1.68 0.12 -0.79 0.07 -0.05 0.23 SDQ26 1.65 0.34 -2.01 0.20 -0.66 0.08 0.54 0.30 SDQ30 5.50 2.08 -1.54 0.10 -0.87 0.06 -0.22 0.20 SDQ36 5.13 1.71 -1.61 0.11 -0.79 0.07 -0.07 0.23 SDQ41 3.33 0.63 -2.01 0.19 -0.98 0.04 -0.01 0.22


54235.48, p < .0001, F0hat = 1.85, RMSEA = .01.

For the IPD analysis (see Figures 4.16 and 4.17), no item qualified as the

anchor, and all items showed significant DIF (see Table 4.5). All items showed higher

discrimination in fifth grade, resulting in a drift of the a parameter from Ma = 2.81

(SDa = 1.20) in third grade to Ma = 3.46 (SDa = 1.38) in fifth grade. In terms of item


91

thresholds, all items except SDQ26 had lower b1 (and b2) and higher (b2 and) b3 in

fifth grade, indicating that most of the items encompassed wider ranges of the θ

continuum for fifth than third-graders. SDQ26 stood out because it actually covered a

narrower range of the continuum when administered in fifth grade. On average, b1

drifted from -1.78 (SDb1 = 0.36) in third grade to -1.80 (SDb1 = 0.31) in fifth grade, b2

from -0.78 (SDb2 = 0.21) to -0.77 (SDb2 = 0.25), and b3 from 0.07 (SDb3 = 0.28) to

0.13(SDb3 = 0.27). Clearly, the nonuniform drifts of b parameters together with the

drift of a parameter resulted in nonuniform DIF in all items (see Figure 4.16).


92

Figure 4.16. Option characteristic curves of item parameter drift (third- vs. fifth-

grade).


93

Figure 4.17. Test characteristic function curves of item parameter drift.


94

CHAPTER V

DISCUSSION

The Psychometric Properties of the SDQI Mathematics Subscale

The Classical Test Theory Perspective In this dissertation, the psychometric properties of the SDQI mathematics

subscale was examined from the classical test theory (CTT), the factor analytic, and

the item response theory (IRT) perspectives to enhance our understandings of

academic interest and affect, particularly in the mathematics domain. To address this,

an important consideration was how the SDQI, an instrument initially developed based

on the framework of academic self-concept (Shavelson & Bolus, 1981; Shavelson et

al., 1976), could be evaluated as a measure of interest. As discussed in Chapter 3,

self-concept is a broad concept that refers to organized, multifaceted perception of

one’s self (Shavelson et al., 1976). This multifaceted nature of self-concept allows

researchers to examine a single aspect of it (e.g., interest/affect) without disregarding

the conceptualization of self-concept. The ECLS-K adapted SDQI mathematics scale

consists of eight items, four of which being affective. In line with the

operationalization of academic interest as discussed in Chapter 2, the four affective

items either assess interest directly (SDQ22—“I am interested in math”) or reflect

some certain aspects of interest such as motivation (SDQ12—“I cannot wait to do

math each day”), liking (SDQ30—“I like math”), and positive emotion (SDQ36—“I

enjoy doing work in math”). In search of a good measure of academic interest, the

SDQI affective items appear to offer adequate face validity of relevant investigations.

Another important consideration was, therefore, whether the four cognitive

items (i.e., SDQ6, 16, 26, and 41) should also be included in the investigation. In

terms of face validity, these cognitive items actually focus on perceived competence

(e.g., SDQ6—“Work in math is easy for me”) which plays a central role in the

formation of academic self-concept (Bong & Skaalvik, 2003). The inclusion of these

items could have confounded the investigation if they happened to underlie a


95

distinguishable dimension other than interest and affect. Because there has been a lot

of debate in the literature as to whether the affective domain was separable from the

cognitive domain (Bong & Skaalvik, 2003), it was unlikely to obtain a widely

acknowledged theoretical basis for the investigations of this study. Hence, factor

analysis was conducted first to provide an empirical basis for the subsequent CTT and

IRT analyses.

The results of factor analysis indicate a relatively loose one-factor structure of

the eight SDQI items. Using the eigenvalue > 1 criterion, one single factor was

extracted as a result of exploratory factor analysis (EFA). This one-factor structure,

however, was not strongly supported by the confirmatory factor analysis (CFA).

Although EFA and CFA often rely on the same estimation methods, they also differ in

the manner by which cross-loadings are handled. More importantly, the CFA

framework offers researchers the ability to specify the nature of relationships among

the measurement errors of the items (Brown, 2006). In regards to the current

investigation, these features of CFA offer at least two explanations for the relatively

poor fit of the one-factor structure. First, it is possible that the eight items tended to

load on multiple factors (e.g., affective vs. cognitive). In this case, poor fit might have

emerged in CFA because the identification restrictions associated with CFA are

achieved in part by fixing the cross-loadings to zero (Brown, 2006). Second, the poor

fit might be caused by correlated errors among items, and the model fit could have

been notably improved once these errors were allowed to correlate. In fact, the

modification indices did suggest very large correlated errors among the SDQI items.

However, correlated errors were not modeled in the current study to avoid model

overfitting in the absence of a theoretical basis. Likewise, multiple-factor solutions

were not tested in CFA due to a lack of empirical basis (i.e., EFA).

From a theoretical perspective, this loose one-factor solution may be a function

of the conceptual fuzziness surrounding the concepts of interest, affect, and perceived

competence. Specifically, perceived competence is closely related to academic

interest and even considered to have a reciprocal relationship with it (Marsh et al.,


96

2005; Wigfield & Eccles, 2002). As such, perceived competence may act as the

indicator of interest and affect, and vice versa. This may bring challenges to the

current investigation because the scale may be interpreted as either a measure of

interest with perceived competence items as its cognitive indicators, or a measure of

perceived competence with interest, liking, and positive affect as its affective

indicators. The criterion-related validity of the scale, as established in this study, does

not seem to address this issue either, given that both perceived competence/self-

efficacy and interest have been identified as positive predictors of academic

performance (Hidi, 2000; Hidi & Harackiewicz, 2000; Klassen & Usher, 2010).

Hence, the item-level characteristics can be particularly helpful for us to gain a better

understanding of the scale.

The item-level characteristics as revealed by the CTT method indicate that the

eight items are of similarly good qualities. Among them, SDQ26 (“I can do very

difficult problems in math”) appears to be weaker than others as assessed by the item-

total correlations as well as Cronbach’s α if item deleted (see Table 4.4). Four items

demonstrate higher item-total correlations across grade levels: SDQ22 (“I am

interested in math”), SDQ30 (“I like math”), SDQ36 (“I enjoy doing work in math”),

and SDQ41 (“I am good at math”). In the factor analytic framework, the four items

also show higher factor loadings (.84 to .93 in third grade; .88 to .93 in fifth grade)

than others (.51 to 79 in third grade; .66 to .81 in fifth grade). It would be premature

though to conclude that this set of items represent the key concept of the SDQI as they

do not surpass others (i.e., SDQ6, SDQ12, and SDQ16) much in their CTT-based

psychometric qualities (see Table 4.4).

From the CTT perspective, the SDQI mathematics subscale demonstrates

sufficient internal reliability, predictive validity, and item-level qualities; however, the

CTT-based evaluation does not adequately address the dimensionality issue of the

scale, particularly whether the scale is primarily measuring perceived competence or

interest. The item response theory (IRT) based evaluation was thus critical.


97

The Item Response Theory Perspective With many advantages over the CTT, the IRT-based evaluation largely

strengthens the item-level findings revealed by the CTT-based method. In the IRT

framework, the same set of items (i.e., SDQ22, 30, 36, and 41) tends to demonstrate

better psychometric qualities than others as indicated by their item discriminations (a).

Specifically, SDQ22, 30, and 36 all have a values above 3 in third grade and a values

above 4 in fifth grade. Because the item information function is proportional to item

discrimination (see Equation 3), these three items offer relatively rich information in

assessing the individual’s latent trait. As shown in Figure 4.4, the areas under the

curve (AUCs) of the other items (i.e., SDQ6, 12, 16, and 26) appear to be limited,

while the AUC of SDQ41 is somewhere between the two sets. Hence, the items

offering the most information in the scale involve explicit expression of interest (“I am

interested in math”), liking (“I like math”), and positive affect (“I enjoy doing work in

math”), adding to the evidence that the scale is primarily measuring interest and

associated affect while including perceived competence items as cognitive indicators

of the interest. Although perceived competence or self-efficacy is theorized to play a

central role in self-concept (Bong & Skaalvik, 2003; Marsh, 1992a), findings of the

current investigation actually suggest that it is the other way around. Arguably, more

attention needs to be directed at the role of interest and affect in the formation of

academic self-concept or other similar constructs.

While the SDQI items in general show “high” to “very high” item

discrimination (Baker, 1985, 2001), the item thresholds (b) of them actually reflect the

inadequacy of the scale, particularly for younger children. It is worth mentioning that

there are three versions of the SDQ: SDQI for pre-adolescents (age 5-12; Marsh,

1992a), SDQII for early adolescents (age 13-17; Marsh, 1992b) and SDQIII for late

adolescents and adults (age 16 and over; Marsh, 1992c). That is, the development of

the SDQ has taken into account children’s developmental stages, and the SDQI was

designed especially for elementary school age children. With regard to the

mathematics domain, the wording of the SDQI items is simpler and more


98

straightforward (e.g., “I like mathematics”) as compared with items in SDQII (e.g., “I

look forward to mathematics classes”) and SDQIII (e.g., “I find many mathematical

problems interesting and challenging”). Moreover, respondents are given a 5-point

scale in the original SDQI (1-false to 5-true), a 6-point scale in the SDQII (1-false to

6-true), and a 8-point scale in the SDQIII (1-definitely false to 8-definitely true). The

ECLS-K adapted SDQI further reduced the number of ordered options to 4 (1-not at

all true to 4-very true), but the current investigation reveals that this adapted version

may still be age-inappropriate among elementary school children, particularly third

graders. The items tend to have very low average thresholds among third graders,

meaning that very low levels of mathematics interest is necessary for a third grader to

endorse response option k and higher rather than the lower-order options (0 through k

– 1). For instance, a randomly selected third grader has a probability of 0.5 to endorse

very true for SDQ30 (“I like math”) with a θ = -0.32, meaning that an individual does

not even need to have an above-average level of interest to respond that he likes math.

In the third-grade calibration, four of the eight items have b3 values that are below zero

(i.e., SDQ22, 30, 36, and 41), which illustrates the inadequacy of the SDQI items in

assessing third graders’ interest levels.

The adapted SDQI tends to perform better among fifth graders that the average

thresholds (i.e., b1, b2, and b3) are higher than those among third graders. Specifically,

it now takes a θ = 0.09 in fifth grade, rather than -0.32 in third grade, for a randomly

selected child to have a probability of 0.5 to endorse very true for SDQ30 (“I like

math”). Nonetheless, the b3 values of SDQ22, 30, 36, and 41 still appear to be the

lower than other items. Furthermore, none of eight items has a b3 which goes beyond

1. Despite that, the SDQI performs better among fifth graders; the scale still appears

to be inadequate in capturing much information of children who are located at the

medium-to-far right side (θ > 1) of the continuum (see Figure 4.5).

Another important finding is that the affective items (i.e., SDQ22, 30, and 36),

though offering the most information across all SDQI items, appear to have the lowest

thresholds among all items as well. That is, the three affective items may perform


99

well in distinguishing those who are of higher interest levels from those of lower

levels, but such high discrimination only functions within a limited range of the θ

continuum (i.e., θ < 0 for third grade; θ < 0.3 for fifth grade). Comparatively,

cognitive, perceived competency items such as SDQ6 (“Work in math is easy for me”)

are of lower discrimination but higher item thresholds. It is not counterintuitive

though that perceived competence items in general entail higher thresholds than

affective items, because even very young children (e.g., first graders) may have

developed domain-specific ability beliefs from their school experiences and peer

comparisons (Eccles et al., 1983, 1989).

Summary The first research question of this dissertation concerns the psychometric

properties of the ECLS-K adapted SDQI in measuring mathematics interest and affect

from both the CTT and IRT perspectives. It was hypothesized that the SDQI would

show sufficient psychometric properties for the sample as a whole across two

measurement occasions (i.e., third- and fifth-grade). This hypothesis is mostly

supported by the CTT results, but only partly supported by the IRT results. From the

CTT perspective, the scale is of good (α = .89) to excellent (α = .92) internal

consistency reliabilities, meaning that the items are consistent among themselves and

with the scale as a whole (Gay et al., 2008). Although SDQ26 is of relatively low

item-total correlations across waves, these correlations are still moderate to large in

magnitude (.42 and .59). Thus, the CTT analysis concludes that the SDQI

mathematics subscale possesses sufficient psychometric properties for the sample as a

whole across waves. The IRT-based evaluation, however, reveals that the SDQI needs

to be reconsidered for its age appropriateness for elementary school children. The

findings of the current investigation suggest that the SDQI performs poorly among

third graders, and tends to perform better when these children advance to fifth grade,

but still not sufficiently well because the items thresholds are lower than we expect

them to be. Future research may investigate specific reasons for this age


100

inappropriateness of the SDQI items, especially the cultural influence as the SDQI was

initially normed based on an Australian sample back in the 1990s (Marsh, 1992a).

Measurement Bias of the SDQI Mathematics Subscale across Gender The second research question of this dissertation concerns the measurement

bias of the SDQI items across gender, and it was hypothesized that gender differences

would emerge at item level as revealed by differential item functioning (DIF) analyses

in an IRT framework. Results of the IRT-based DIF analyses in general support this

hypothesis, but the findings also indicate some developmental changes in such

measurement bias. Specifically, all items (100%) in the scale demonstrate

measurement bias across gender in third grade but only five of them (62.5%) show

significant DIF in fifth grade.

A key finding is that the item-level measurement bias tends to be domain-

specific across waves: cognitive, perceived competence items tend to be in favor of

boys while affective, interest-related items tend to be in favor of girls. That is, a boy

is more likely to endorse higher-order options of perceived competence items than a

girl, even given the same level of mathematics interest. Conversely, a girl is more

likely to endorse higher-order options of affective items than a boy. Thus, the DIF of

the eight items appears to be balanced in the sense that half of them are in favor of one

group while half of them the other. This domain specificity of DIF is evident in fifth

grade as well, except that three of the items have become invariant across gender by

this age. The three items include SDQ16 (“I get good grades in math”), SDQ22 (“I

am interested in math”), and SDQ30 (“I like math”). In other words, a fifth-grade boy

and a fifth-grade girl tend to respond to the three items in the same manner, despite the

fact that they might have responded to them differently in third grade.

Such developmental changes also underlie the invariance in item

discrimination (a). In general, the SDQI items show higher discrimination for girls

than boys. For instance, the average item discrimination is 3.00 (SD = 1.33) for girls

and 2.75 (SD = 1.14) for boys in third grade. In fifth grade, this gap becomes


101

narrower that the average item discrimination has become 3.27 (SD = 1.30) for girls

and 3.15 (SD = 1.27) for boys. A relevant finding is that nonuniform DIF is observed

in more items in third grade than fifth grade. The fifth-grade DIF is mostly uniform

being a function of the DIF in b parameters alone. Taken together, it can be concluded

that the SDQI mathematics subscale tends to become more invariant across gender

over elementary school years (i.e., third- to fifth-grade).

Some of these findings may be explained by the existing literature. For

example, the domain-specific measurement bias may be accounted for by expectancy-

value theory-based studies (e.g., Eccles et al., 1983, 1989; Wigfield & Eccles, 2000).

Eccles and colleagues’ work has focused on the development of these constructs of

children and adolescents. It was found that children begin to form clearly distinct

ability-expectancy beliefs as early as first grade within the domains of mathematics,

reading, music, and sports. More importantly, their research has shown that boys’ and

girls’ beliefs and values differ in gender stereotypic ways. Such gender stereotypic

beliefs plausibly serve as the underlying mechanism for the measurement bias of the

SDQI cognitive items, especially given the domain of mathematics (Bhana, 2005). On

the other hand, it is unclear why three of the items showing bias in third grade (i.e.,

SDQ16, 22, and 30) would become invariant in the fifth grade. It seems impossible to

summarize the common characteristics of this set of items by examining the wording

of them. For instance, it is difficult to explain why fifth-grade boys and girls respond

to SDQ16 (“I get good grades in math”) in the same manner, but respond to SDQ41

(“I am good at math”) differently. Furthermore, limited by only two waves of

measurement, the current investigation is incapable of extending its inferences beyond

elementary school. For example, will measurement bias across gender keep

diminishing as children advance to middle school? Future research may extend such

investigations to adolescence or even adulthood from a longitudinal perspective.

Measurement Bias of the SDQI Mathematics Subscale across Ethnicity The third research question of this dissertation concerns the measurement bias

of the SDQI items across ethnicity, and it was hypothesized that ethnic differences


102

would emerge at item level as revealed by differential item functioning (DIF) analyses

in an IRT framework. Results of the IRT-based DIF analyses in general support this

hypothesis, but the findings are fairly complicated because three comparisons (i.e.,

White vs. African American, White vs. Hispanic, and White vs. Asian) were

conducted in each wave. These comparisons are not necessarily telling a singular story

about this research question.

The DIF between White and African American respondents appears to be the

most substantial among all comparisons (see Table 4.6), and the pattern of the DIF is

also the most complicated. Overall, the SDQI items do not have as high

discrimination for African American children as for White children, so the OCCs of

most items appear to be flatter for African American respondents (see Figures 4.9 and

4.12). In terms of the invariance in item thresholds, no clear pattern can be identified

across waves. In third grade, African American children are more likely to endorse

higher-order options of SDQ12 (“I cannot wait to do math each day”), SDQ26 (“I can

do very difficult problems in math”), SDQ30 (“I like math), and SDQ36 (“I enjoy

doing work in math”), whereas White children are more likely to endorse higher-order

options of SDQ16 (“I get good grades in math”). The DIF does not follow the

cognitive-affective distinction as identified in the DIF analysis across gender. To

further complicate this situation, the DIF in items become different in fifth grade:

SDQ12, 22, and 30 are in favor of the African American group, while SDQ6 and 16

are in favor of the White group. The underlying mechanism is that even the DIF

across b parameters (i.e., b1, b2, and b3) of the same item is inconsistent, meaning that

some items may have lower b1 (and b2) as well as higher (b2 and) b3 for White

respondents. In general, the items encompass narrower ranges of the θ continuum for

African American children. These findings caution the future use of the SDQI among

the African American population because, in addition to showing measuring bias

between White and African American children, the items also appear to be poorer

psychometric qualities among African Americans. For instance, the item

discrimination is as low as 0.71 for SDQ26 for African Americans in third grade,


103

causing this item to be quite incapable of distinguishing those with high interest levels

from those with low levels.

The DIF between White and Hispanic respondents is less substantial and less

complicated. Although all items show significant DIF between groups, the DIF

clearly follows the cognitive-affective distinction that Hispanic children are more

likely to endorse higher-order options of affective items, while White children are

more likely to endorse higher-order options of cognitive, perceived competence items.

As compared with their White counterparts, Hispanic students are more likely to

represent their interest for mathematics by endorsing direct, affective statements such

as “I like math,” and less likely to self report their interest levels through perceive

competence items. From a developmental perspective, it is also worth noting that the

DIF between groups tends to become even less substantial in fifth grade, particularly

in the DIF of item discrimination (a). The gap between the average item

discrimination between White and Hispanic groups tends to close in fifth grade, and as

a result, the DIF turns from mostly nonuniform to mostly uniform from third to fifth-

grade (see Figures 4.10 and 4.13).

The DIF between White and Asian American respondents is the least

substantial among all comparisons. In particular, SDQ22 (“I am interested in math”)

and 30 (“I like math”) are invariant across grade levels, and SDQ36 (“I enjoy doing

work in math”) also become invariant in fifth grade. Thus, the items offering the most

information appear to measure White and Asian children’s mathematics interest

equivalently. For the remaining items with measurement bias, the DIF also appears to

follow the affective-cognitive distinction. As in the White versus Hispanic

comparison, Asian American children are more likely to endorse affective items for

representing their mathematics interest. In addition, the average item discrimination is

similar to and even a bit higher for Asian American than White children across grade

levels.

Although results of these comparisons are mixed, some key findings may still

be summarized with respect to the measure bias of the SDQI items across ethnic


104

groups. First, most of the SDQI mathematics items (50% to 100%) show certain

levels of measurement bias according to ethnic background of the respondent. Mostly,

the DIF is shown to be of a joint influence of the invariance in item discrimination (a)

and item thresholds (b1, b2, and b3), thus being nonuniform. Second, although there

exists substantial DIF between White and any of the ethnic minorities, assuming

measurement equivalence can be most problematic for any group comparison that

involves the African American population. When the investigations involve the

Hispanic or Asian population, researchers need to bear in mind that they tend to

respond to perceived competence items and affective items differently as compared

with White people. Third, the affective-cognitive distinction seems to play an

important role in the White versus Hispanic and White versus Asian comparisons,

though the DIF of the former appears to be more substantial. It is unclear whether

stereotype threat (Ogbu, 2003; Steele & Aronson, 1995) is a contributing factor of this

phenomenon because, unlike African American and Hispanic students, Asian

American students tend to be stereotyped for their excellence in STEM fields (General

Accounting Office, 2005; Goyette & Xie, 1999; Herrara & Hurtado, 2011). Fourth,

there seems to be a trend for the DIF to diminish over time in all comparisons, which

requires further investigations from a longitudinal perspective. Last but not the least,

as indicated by the test characteristic curves of each comparison, the item-level DIF of

the opposite directions tend to cancel out that the test characteristic curves in most

cases appear to coincide (see Figures 4.8, 4.15, and 4.17). As such, the true scores

remain close enough between the two groups to be trusted for most analyses of group

differences.

Item Parameter Drift of the SDQI Mathematics Subscale The fourth research question of this dissertation concerns the measurement

bias of the SDQI items across age groups, and it was hypothesized that item

differences by age group would emerge at the item-level as revealed by differential

item functioning analyses in an IRT framework. The DIF results support this

hypothesis that all items demonstrate significant DIF across age groups. As discussed


105

earlier, the scale is of better psychometric qualities for assessing fifth graders’ than

third graders’ mathematics interest. The overall improvement of the scale may partly

be accounted for by the IPD findings of this study. As children transition from third-

to fifth-grade, they respond more sensitively to the option anchors (e.g., not at all true,

very true) of the SDQI items. For instance, it takes a randomly selected fifth grader a

lower θ than a third grader to have a same probability to endorse response Option 0

(very at all true), indicating that this anchor performs better in identifying children of

truly low interest levels in fifth grade. Likewise, it takes a randomly selected fifth

grader a higher θ than a third grader to have a same probability to endorse response

Option 3 (very true) of an item, indicating that this anchor performs better in fifth

grade in identifying children of relatively high interest levels. This average pattern of

drift in b parameters, however, is not large enough in magnitude to result in any

notable improvement of the SDQI in its item level or as a whole. As shown in Figures

4.16 and 4.17, most of the OCCs still intercept at some point to show nonuniform DIF

at item level, and the test characteristic curves of third- and fifth-grade data nearly

coincide with each other.

Conclusion The fifth research question of this dissertation entails a summary of the DIF

findings as a function of gender, ethnicity, and age. It was hypothesized that item-

level responses will change over time as a whole, and as a function of group

membership according to gender and ethnic group. This hypothesis is certainly

supported by the DIF findings. In the meantime, several issues emerged to note

limitations of this study, as well as to provide recommendations for future research.

In mentioning the overlap of the test characteristic curves in all group

comparisons, the multifaceted nature of academic self-concept, based on which the

SDQ was constructed, needs to be revisited. This dissertation concludes that, although

item-level DIF emerged as a function of gender, ethnicity, and age, the scale scores

can be trusted because the DIF tends to cancel out. Results of factor analysis, however,

indicate a loose one-factor structure of the scale. While multiple-factor solutions were


106

not tested in the current study due to the lack of empirical basis, it is possible that a 2-

factor (i.e., cognitive vs. affective) structure is valid in some subpopulations. That is,

perceived competence may have a lower correlation with academic interest in the

subpopulation that the items tend to load on a separate factor instead. If this is the

case, then DIF of the SDQI will need to be examined in the subscale level. Given a

cognitive versus affective distinction, for example, DIF analysis will need to be

conducted on the cognitive and affective items separately. Such results cannot be

predicted from the current investigation because the latent construct will be different.

Future research may consider adopting growth mixture modeling to better account for

the weak unidimensionality before any DIF analysis.

From a developmental perspective, future research may also examine the

respondent-item interaction of school age children and adolescence. Findings of the

current study seem to suggest that a 4-point scale presents still too many response

options for elementary school children, particularly younger ones. In particular, the

item thresholds appear to be so close to each other that the scale functions more like a

dichotomous item (i.e., 1 to 3-false, 4-true). This finding actually corresponds to

previous research that younger children tend to respond in an extreme manner to

Likert-type rating scales (Chambers & Craig, 1998; Goodenough et al., 1997; von

Baeyer, Carlson & Webb, 1997). Underlying this is the Piagetian theory that young

children characteristically engage in dichotomous thinking and may therefore focus on

either extreme of the response options (i.e., not at all true and very true). Interestingly,

younger children (e.g., 7- to 9-year-olds as compared with 10- to 12-year-olds) were

found to engage in such dichotomous thinking particularly when rating emotion-based

statements (Chambers & Johnston, 2002). In the same study, it was also reported that

children’s extreme scores were not a function of the number of options that children

who used the 3-point options had similar extreme scores, as did children who used the

5-point ones. As such, a dichotomous option (i.e., true or false) may be the most

appropriate and economical response format for the current SDQI items, particularly

the affective, emotion-based ones, because children are still likely to endorse extreme


107

scores when the scale is tailored for 3-point options. A limitation to note here is that

the aforementioned dichotomous thinking of young children does not account for the

limited and unbalanced coverage of the θ continuum of the SDQI items from the IRT

perspective. In addition, the current investigation has also observed a trend for the

thresholds to expand over the θ continuum as children grow older, but more data over

multiple occasions are needed to test this trend.

Relatedly, cultural influences must not be neglected, especially since the

original SDQI was normed in Australian samples (Marsh, 1992a). That the average

item thresholds are very low among U.S. children may also be a function of the

culture in which individuals are encouraged to express their feelings and thoughts in a

positive way. While this notion is preliminary and certainly requires further

exploration, cultural differences in how people experience, regulate, and express their

emotions are actually well-documented in the literature (e.g., Goetz, Spencer-Rodgers,

& Peng, 2008; Scollon, Diener, Oishi, & Biswas-Diener, 2004, 2005). In particular,

research has shown that different cultures socialize children to regulate their emotions

differently. For example, U.S. mothers are more likely to up-regulate children’s

positive emotions by highlighting children’s success, while Chinese mothers are more

likely to down-regulate children’s positive emotions by not highlighting their success

(Miller, Wang, Sandel, & Cho, 2002; Ng, Pomerantz, & Lam, 2007). Such findings

may help us understand why U.S. children tend to endorse the high extreme of the

options, and further, why White children are more likely to represent their academic

interest by highlighting their competence and success. To raise the average item

thresholds of the SDQI among U.S. children, items may be revised to present more

challenges and elicit more thinking of children. For instance, “I like math” may be

revised to “I like math even when I cannot work out the problems.” Of course, such

revisions may lead to other issues such as double-barrier options and low readability

for young children. Future research must carefully examine this trade-off in such

measures.


108

In terms of content relevance of the items, future measures of interest may also

consider including other aspects discussed in the interest literature such as curiosity,

attitude, and attention. Among these aspects of interest, educational research needs to

pay special attention to the role of attitude/value in developing and maintaining

academic interest. In general, attitude or value refers to an individual’s evaluation or

appraisal of the activity (Ajzen, 1988, 1991; Wigfield & Eccles, 2000, 2002). The

social cognitive perspective postulates that a learner’s motivation is jointly influenced

by his self-efficacy/perceived competence and his outcome expectancy /attitude/value,

and it was theorized earlier in this dissertation (see Chapter 2) that the social cognitive

perspective may well apply to the approaches to interest research. The current SDQI

items seem to address the affective and perceived competence aspects of interest well,

but largely neglect the role of value in defining interest.

According to Renninger and Hidi (2002), value plays an important role in

developing one’s unstable situational interest into personal interest, a more well-

developed type of interest which helps maintain and deepen the student’s motivation.

To be more specific, it requires certain levels of stored prior knowledge and positive

value within an individual for his situational interest to evolve into personal interest.

As such, the assessment of one’s positive value associated with the task may be a good

indicator of how well-developed and stable his academic interest is, and it is

recommended that future instruments of interest also include the measure of value. It

is worth noting that the construct of value is termed and operationalized differently in

various theories, and test developers must carefully select or design value items which

best serve the measure of interest.

A final limitation of this study is that it does not address the issues over the

distinction of personal interest and situational interest. The SDQI items were not

developed based on this theoretical classification, so the affective statements are

generally straightforward with little context information to be identified as

representing personal interest or situational interest. The IRT-based evaluation was

expected to yield results that might help distinguish situational interest from personal


109

interest, but it turned out that all affective items demonstrated similar psychometric

properties. In addition to the inclusion of value in defining interest, future measures of

interest may also address the personal-situational distinction of interest. Very few

studies have been conducted with regard to this distinction, but the systematic review

of situational interest by Schraw and Lehman (2001) may offer some guidelines for

future assessments of situational interest. They concluded that one could distinguish

among different types of situational interest, and organized empirical research on

interest into three main categories of situational interest as text-based, task-based, and

knowledge-based. In Mitchell’s (1993) interest survey, for example, two general areas

of interest (i.e., personal and situational) and five specific components of situational

interest (i.e., meaningfulness, involvement, computers, groups, and puzzles) are

assessed to present the multifaceted structure of interest. This survey includes items

such as “mathematics is enjoyable to me” and “I have always enjoyed studying

mathematics in school” to assess high school students’ personal interest, and items

such as “our class is fun” and “this year I like math” to assess their situational interest.

As compared with these items, the SDQI ones seem to mostly assess personal interest

due to a lack of context- or time-specificity.

In summary, findings of this study provide several directions for future

research on academic interest and affect, particularly for instrument development and

validation in the field. To develop a good self-report measure of children’s academic

interest, test developers will need to go through a list of considerations including

children’s cognitive abilities, their special interactions with emotion-based items, the

cultural scripts of the society and certain subpopulations, the wording and content

relevance of the items, and the personal-situational distinction and multifaceted

structure of interest.


110

BIBLIOGRAPHY

Ai, X. (2002). Gender differences in growth in mathematics achievement: Three-level longitudinal and multilevel analyses of individual, home, and school influences. Mathematical Thinking and Learning, 4(1), 1-22.

Ajzen, I. (1988). Attitudes, personality, and behavior. Milton Keynes, England: Open University Press.

Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50, 179-211.

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Bainbridge, W. L., & Lasley, T. J. (2002). Demographics, diversity, and K-12 accountability. The challenge of closing the achievement gap. Education and Urban Society, 34(4), 422-437.

Baker, F. B. (1985). The basics of item response theory. Portsmouth, NH: Heinemann.

Baker, F. B. (2001). The basics of item response theory (2nd ed.). College Park, MD: ERIC Clearinghouse on Assessment and Evaluation, University of Maryland.

Bandura, A. (1977). Social learning theory. Englewood Cliffs, NJ: Prentice-Hall.

Bandura, A. (1986). Social foundations of thought and action: A social cognitive theory. Englewood Cliffs, NJ: Prentice Hall.

Bandura, A. (1997). Self-efficacy: The exercise of control. New York, NY: Freeman.

Barak, A. (1981). Vocational interests. Journal of Vocational Behavior, 19, 1-14.

Barlett, F. C. (1932). Remembering: A study in experimental and social psychology. New York, NY: Cambridge University Press.

Bhana, D. (2005). “I’m the best in maths. Boys rule, girls drool.” Masculinities, mathematics, and primary schooling. Perspectives in Education, 23, 1-10.

Blechman, E. A. (1990). Moods, affect, and emotions. Hillsdale, NJ: Lawrence Erlbaum Associates.


111

Bolt, D. M. (2002). A Monte Carlo comparison of parametric and nonparametric polytomous DIF detection methods. Applied Measurement in Education, 15(2), 113-141.

Bong, M., & Clark, R. E. (1999). Comparison between self-concept and self-efficacy in academic motivation research. Educational Psychologist, 34, 139-154.

Bong, M., & Skaalvik, E. M. (2003). Academic self-concept and self-efficacy: How different are they really? Educational Psychology Review, 15(1), 1-40.

Brown, T. A. (2006). Confirmatory factor analysis for applied research. New York, NY: Guilford.

Bussey, K., & Bandura, A. (1992). Self-regulatory mechanisms governing gender development. Child Development, 63, 1236-1250.

Bussey, K., & Bandura, A. (2004). Social cognitive theory of gender development and differentiation. In A. H. Eagly, A. E. Beall, & R. J. Sternberg (Eds.), The psychology of gender (2nd ed., pp. 92-119). New York, NY: Guilford Press.

Byrnes, J. P. (2003). Factors predictive of mathematics achievement in White, Black, and Hispanic 12th graders. Journal of Educational Psychology, 95(2), 316-326.

Cai, L. (2008). SEM of another flavour: Two new applications of the supplemented EM algorithm. British Journal of Mathematical and Statistical Psychology, 61, 309-329.

Cai, L. (2012). flexMIRT: Flexible multilevel item factor analysis and test scoring [Computer Software]. Seattle, WA: Vector Psychometric Group, LLC.

Cai, L., Thissen, D., & du Toit, S. H. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling [Computer software]. Lincolnwood, IL: Scientific Software International.

Casey, M. B., Nuttall, R. L., & Pezaris, E. (1997). Mediators of gender differences in mathematics college entrance test scores: A comparison of spatial skills with internalized beliefs and anxieties. Developmental Psychology, 33, 669-680.

Catsambis, S. (1994). The path to math: Gender and racial-ethnic differences in mathematics participation from middle school to high school. Sociology of Education, 67, 199-215.

Chambers, C. T., & Craig, K. D. (1998). An intrusive impact of anchors in children’s faces pain scales. Pain, 78, 27-37.


112

Chambers, C. T., & Johnston, C. (2002). Developmental differences in children’s use of rating scales. Journal of Pediatric Psychology, 27(1), 27-36.

Church, A. T. (2001). Personality measurement in cross-cultural perspective. Journal of Personality, 69(6), 979-1006.

Cook, R. D., & Weisberg, S. (1982). Residuals and influence in regression. New York, NY: Chapman & Hall.

de Ayala, R. J. (2009). The theory and practice of item response theory. New York, NY: The Guilford Press.

DeCuir-Gunby, J. T., Aultman, L. P., & Schutz, P. A. (2009). Investigating transactions among motives, emotional regulation related to testing, and test emotions. The Journal of Experimental Education, 77(4), 409-436.

DeJarnette, N. K. (2012). America's children: Providing early exposure to STEM (Science, Technology, Engineering and Math) initiatives. Education, 133(1), 77-84.

Denes-Raj, V., & Epstein, S. (1994). Conflict between intuitive and rational processing: When people behave against their better judgment. Journal of Personality and Social Psychology, 66(5), 819-829.

Dewey, J. (1913). Interest and effort in education. Boston, MA: Riverside Press.

Dweck, C. S. (1999). Self-theories: Their role in motivation. Philadelphia, PA: Taylor & Francis.

Dweck, C. S., & Elliott, E. S. (1983). Achievement motivation. In P. H. Mussen, & E. M. Heatherington (Eds.), Handbook of child psychology: Vol 4. Socialization, personality, and social development (pp. 643-691). New York, NY: Wiley.

Dweck, C. S., & Leggett, E. L. (1988). A social-cognitive approach to motivation and personality. Psychological Review, 95, 256-273.

Eccles, J. S., & Midgley, C. (1989). Stage/environment fit: Developmentally appropriate classrooms for adolescents. In R. Ames, & C. Ames (Eds.), Research on motivation in education (Vol. III, pp. 139-181). New York, NY: Academic Press.

Eccles, J. S., & Wigfield, A. (1995). In the mind of the actor: The structure of adolescents' achievement task values and expectancy-related beliefs. Personality and Social Psychology Bulletin, 21, 215-225.


113

Eccles, J. S., Adler, T. F., Futterman, R., Goff, S. B., Kaczala, C. M., & Meece, J. L. (1983). Expectations, values, and academic behaviors. In J. T. Spence (Ed.), Achievement and achievement motivation (pp. 75-146). San Francisco, CA: W. H. Freeman.

Eccles, J. S., Wigfield, A., & Schiefele, U. (1998). Motivation to succeed. In N. Eisenberg (Ed.), Social, emotional, and personality development in handbook of child psychology (Vol. III, pp. 1017-1096). New York, NY: Wiley.

Eccles, J. S., Wigfield, A., Flanagan, C. A., Miller, C., Reuman, D. A., & Yee, D. (1989). Self-concepts, domain values, and self-esteem: Relations and changes at early adolescence. Journal of Personality, 57(2), 283-310.

Edelen, M. O., Thissen, D., Teresi, J. A., Kleinman, M., & Ocepek-Welikson, K. (2006). Identification of differential item functioning using item response theory and the likelihood-based model comparison approach: Application to the Mini-Mental State Examination. Medical Care, 44(11), S134-S142.

Else-Quest, N. M., Hyde, J. S., & Hejmadi, A. (2008). Mother and child emotions during mathematics homework. Mathematical Thinking and Learning, 10, 5-35.

Else-Quest, N. M., Hyde, J. S., & Linn, M. C. (2010). Cross-national patterns of gender differences in mathematics: A meta-analysis. Psychological Bulletin, 136(1), 103-127.

Else-Quest, N. M., Mineo, C. C., & Higgins, A. (2013). Math and science attitudes and achievement at the intersection of gender and ethnicity. Psychology of Women Quarterly, 37(3), 293-309.

Epstein, S. (1990). Cognitive-experiential Self-theory. In L. Pervin (Ed.), Handbook of personality theory and research: Theory and research (pp. 165-192). New York, NY: Guilford Publications, Inc.

Evans, E. M., Schweingruber, H., & Stevenson, H. W. (2002). Gender differences in interest and knowledge acquisition: The United States, Taiwan, and Japan. Sex Roles, 37(3/4), 153-167.

Evans, K. M. (1971). Attitudes and interests in education (2nd ed.). London, UK: Routledge & Kegan.

Eysenck, M. E. (1982). Attention and arousal. New York, NY: Springer-Verlag.

Field, A. (2009). Discovering statistics using SPSS (3rd ed.). Thousand Oaks, CA: Sage Publications.


114

Flavell, J. (1987). Speculntions about the nature and development of metacognition. In F. E. Weinert, & R. H. Kluwe (Eds.), Metacognition, motivation, and understanding (pp. 21-64). Hillsdale, NJ: Erlbaum.

Fleischman, H. L., Hopstock, P. J., Pelczar, M. P., & Shelley, B. E. (2010). Highlights from PISA 2009: Performance of U.S. 15-year old students in reading, mathematics, and science literacy in an international context (NCES 2011-004). Washington, DC: U.S. Government Printing Office.

Forgas, J. (2000). The role of affect in social cognition. In J. Forgas (Ed.), Feeling and thinking (pp. 1-28). New York, NY: Cambridge University Press.

Freud, S. (1961). Some psychical consequences of the anatomical distinction between the sexes. In J. Strachey (Ed.), The standard edition of the complete psychological works of Sigmund Freud (J. Strachey, Trans., Vol. 19, pp. 241-258). London, UK: Hogarth Press.

Frijda, N. (1994). Varieties of affect: Emotions and episodes, moods, and sentiments. In P. Ekman, & R. J. Davidson (Eds.), The nature of emotion (pp. 59-67). New York, NY: Oxford University Press.

Gall, M. D., Gall, J. P., & Borg, W. R. (2006). Educational research: An introduction (8th ed.). Boston, MA: Pearson.

Gay, L. R., Mills, G. E., & Airasian, P. (2008). Educational research: Competencies for analysis and applications (9th ed.). Upper Saddle Ridge, NJ: Pearson.

General Accounting Office. (2005). Higher education: Federal science, technology, engineering, and mathematics programs and related trends (Publication No. GAO-06-114). Retrieved from U.S. Government Accountability Office website: http://www.gao.gov/products/GAO-06-702T.

Goetz, J. L., Spencer-Rodgers, J., & Peng, K. (2008). Dialectical emotions: How cultural epistemologies influence the experience and regulation of emotional complexity. In R. Sorrentino, & S. Yamaguchi (Eds.), Handbook of motivation and cognition across cultures (pp. 517-539). Amsterdam, The Netherland: Elsevier.

Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 33, 315-332.

Goodenough, B., Kampel, L., Champion, G. D., Laubreaux, L., Nicholas, M. K., Ziegler, J. B., & McInerney, M. (1997). An investigation of the placebo effect and age-related factors in the report of needle pain from venepuncture in children. Pain, 72, 383-391.


115

Gorsuch, R. L. (1997). Exploratory factor analysis: Its role in item analysis. Journal of Personality Assessment, 68(3), 532-560.

Goyette, K., & Xie, Y. (1999). Educational expectations of Asian American youths: Determinants and ethnic differences. Sociology of Education, 72, 22-36.

Guiso, L., Monte, F., Sapienza, P., & Zingales, L. (2008). Culture, gender, and math. Science, 320, 1164-1165.

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic Publishers.

Harter, S. (1981). A model of mastery motivation in children: Individual differences and developmental change. In W. A. Collins (Ed.), Aspects on the development of competence: The Minnesota symposia on child psychology (Vol. 14, pp. 215-255). Hillsdale, NJ: Erlbaum.

Harter, S. (1996). Teacher and classmate influences on scholastic motivation, self-esteem, and level of voice in adolescents. In J. Juvonen, & K. R. Wentzel (Eds.), Social motivation: Understanding children's school adjustment (pp. 11-42). Cambridge, England: Cambridge University Press.

Herbart, J. F. (1806). Allgemeine Pädagogik, aus dem Zweck der Erziehung abgeleitet. In J. F. Herbart (Ed.), Pädagogische schriften (Vol. II). Düsseldorf, Germany: Kupper.

Herrara, F. A., & Hurtado, S. (2011). Maintaining initial interests: Developing science, technology, engineering, and mathematics (STEM) career aspirations among underrepresented racial minority students. Retrieved from Higher Education Research Institute website: http://www.heri.ucla.edu/publications-main.php.

Herrera, A., & Gomez, J. (2008). Influence of equal or unequal comparison group sample sizes on the detection of differential item functioning using the Mantel-Haenszel and logistic regression techniques. Quality and Quantity, 42(6), 739-755.

Hidi, S. (1990). Interest and its contribution as a mental resource for learning. Review of Educational Research, 60, 323-350.

Hidi, S. (2000). An interest researcher's perspective: The effects of extrinsic and intrinsic factors on motivation. In C. Sansone, & J. Harackiewicz (Eds.), Intrinsic and extrinsic motivation (pp. 309-339). San Diego, CA: Academic Press.


116

Hidi, S., & Anderson, V. (1992). Situational interest and its impact on reading and expository writing. In K. A. Renninger, S. Hidi, & A. Krapp (Eds.), The role of interest in learning and development (pp. 215-238). Hillsdale, NJ: Erlbaum.

Hidi, S., & Harackiewicz, J. (2000). Motivating the academically unmotivated: A critical issue for the 21st century. Review of Educational Research, 70, 151-179.

Hidi, S., Baird, W., & Hildyard, A. (1982). That's important but is it interesting? Two factors in text processing. In A. Flammer, & W. Kintsch (Eds.), Discourse processing (pp. 63-75). Amsterdam, The Netherland: North-Holland.

Hoffmann, L. (2002). Promoting girls' interest and achievement in physics classes for beginners. Learning and Instruction, 12(4), 447-465.

Holland, J. L. (1985). Making vocational choices: A theory of vocational personalities and work environments (2nd ed.). Englewood Cliffs, NJ: Prentice Hall.

Holland, P. W., & Thayer, D. T. (1988). Differential item functioning and the Mantel-Haenszel procedure. In H. Wainer, & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum.

Holland, P. W., & Wainer, H. (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum Associates.

Hossian, M. M., & Robinson, M. G. (2012). How to motivate US students to pursue STEM (Science, Technology, Engineering and Mathematics) careers. US-China Education Review A, 4, 442-451.

Houts, C. R., & Cai, L. (2013). flexMIRT(R) user's manual version 2: Flexible multilevel multidimensional item analysis and test scoring. Chapel Hill, NC: Vector Psychometric Group.

Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1-55.

Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 221-233.

Hutcheson, G., & Sofroniou, N. (1999). The multivariate social scientist. London, UK: Sage.


117

Hyde, J. S., Fennema, E., Ryan, M., Frost, L. A., & Hopp, C. (1990). Gender comparisons of mathematics attitudes and affect. Psychology of Women Quarterly, 14, 299-324.

Hyde, J. S., Lindberg, S. M., Linn, M. C., Ellis, A., & Williams, C. (2008). Gender similarities characterize math performance. Science, 321, 494-495.

Iran-Nejad, A. (1987). Cognitive and affective causes of interest and liking. Journal of Educational Psychology, 79(2), 120-130.

Izard, C. E. (1977). Human emotions. New York, NY: Plenum Press.

Izard, C. E., & Ackerman, B. P. (2000). Motivational, organizational, and regulatory functions of discrete emotions. In M. Lewis, & J. Haviland-Jones (Eds.), Handbook of emotions (2nd ed., pp. 253-264). New York, NY: Guilford Press.

Jacobs, J. E., & Simpkins, S. D. (2005). Mapping the leaks in the math, science, and technology pipeline. New Directions for Child and Adolescent, 110, 3-6.

Kahle, J., Parker, L., Rennie, L., & Riley, D. (1993). Gender differences in science education: Building a model. Educational Psychology, 28, 379-404.

Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151.

Kao, G., & Thompson, J. (2003). Racial and ethnic stratification in educational achievement and attainment. Annual Review of Sociology, 29, 417-442.

Kaplan, A., & Midgley, C. (1999). The relationship between perceptions of the classroom goal structure and early adolescents’ affect in school: The mediating role of coping strategies. Learning and Individual Differences, 11, 187-212.

Kelley, H. H., & Michela, J. (1980). Attribution theory and research. Annual Review of Psychology, 31, 457-501.

Kim, J., & Oshima, T. C. (2013). Effect of multiple testing adjustment in differential item functioning detection. Educational and Psychological Measurement, 73, 458-470.

Klassen, R. M., & Usher, E. L. (2010). Self-efficacy in educational settings: Recent research and emerging directions. In T. C. Urdan, & S. A. Karabenick (Eds.), Advances in motivation and achievement: Vol. 16A. The decade ahead: Theoretical perspectives on motivation and achievement (pp. 1-33). Bingley, UK: Emerald Publishing Group.


118

Kline, P. (2000). The handbook of psychological testing (2nd ed.). New York, NY: Routledge.

Köller, O., Baumert, J., & Schnabel, K. (2001). Does interest matter? The relationship between academic interest and achievement in mathematics. Journal for Research in Mathematics Education, 32(5), 448-470.

Korpershoek, H., Kuyper, H., Bosker, R., & van der Werf, G. (2013). Students leaving the STEM pipeline: An investigation of their attitudes and the influence of significant others on their study choice. Research Papers in Education, 28(4), 483-505.

Krapp, A. (1999). Interest, motivation, and learning: An educational-psychological perspective. Learning and Instruction, 14(1), 23-40.

Krapp, A., Hidi, S., & Renninger, K. A. (1992). Interest, learning, and development. In K. A. Renninger, S. Hidi, & A. Krapp (Eds.), The role of interest in learning and development (pp. 3-25). Hillsdale, NJ: Erlbaum.

Lacey, T. A., & Wright, B. (2009). Occupational employment projections to 2018. Monthly Labor Review, 132(11), 82-123.

Langer, M. (2008). A reexamination of Lord’s Wald test for differential item functioning using item response theory and modern error estimation (Doctoral dissertation). Retrieved from https://cdr.lib.unc.edu/

Larsen, R. J., & Fredrickson, B. L. (1999). Measurement issues in emotion research. In D. Kahneman, E. Diener, & N. Schwarz (Eds.), Well-being: Foundations of hedonic psychology (pp. 40-60). New York, NY: Russell Sage.

Leahey, E., & Guo, G. (2001). Gender differences in mathematical trajectories. Social Forces, 80(2), 713-732.

Lee, V. E., & Burkam, D. T. (2003). Inequality at the starting gate: Social background differences in achievement as children begin school. Washington, DC: Economic Policy Institute.

Lepper, M. R., Greene, D., & Nisbett, R. E. (1973). Undermining children's intrinsic interest with extrinsic reward: A test of the "overjustification" hypothesis. Journal of Personality and Social Psychology, 28, 129-137.

Lipsitz, S., & Fitzmaurice, G. (2009). Generalized estimating equations for longitudinal data analysis. In G. Fitzmaurice, M. Davidian, G. Verbeke, & G. Molenberghs (Eds.), Longitudinal data analysis (pp. 43-78). Boca Raton, FL: Chapman and Hall/CRC.


119

Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ: Prentice Hall.

Lord, F. M. (1977). A study of item bias, using item characteristic curve theory. In Y. H. Poortinga (Ed.), Basic problems in cross-cultural psychology (pp. 19-29). Amsterdam, The Netherland: Swets and Zeitlinger.

Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.

Marsh, H. W. (1992a). Self Description Questionnaire (SDQ) I: A theoretical and empirical basis for the measurement of multiple dimensions of preadolescent self-concept. An interim test manual and research monograph. Macarthur, New South Wales, Australia: University of Western Sydney, Faculty of Education.

Marsh, H. W. (1992b). Self Description Questionnaire (SDQ) II: A theoretical and empirical basis for the measurement of multiple dimensions of adolescent self-concept. A test manual and research monograph. Macarthur, New South Wales, Australia: University of Western Sydney, Faculty of Education.

Marsh, H. W. (1992c). Self Description Questionnaire (SDQ) III: A theoretical and empirical basis for the measurement of multiple dimensions of late adolescent self-concept. An interim test manual and research monograph. Macarthur, New South Wales, Australia: University of Western Sydney, Faculty of Education.

Marsh, H. W., & O'Neill, R. (1984). Self Description Questionnaire III: The construct validity of multidimensional self-concept ratings by late adolescents. Journal of Educational Measurement, 21(2), 153-174.

Marsh, H. W., Hau, K. T., & Wen, Z. (2004). In search of golden rules: Comment on hypothesis-testing approaches to setting cutoff values for fit indexes and dangers in overgeneralizing Hu and Bentler's (1999) findings. Structural Equation Modeling, 11, 320-341.

Marsh, H. W., Relich, J. D., & Smith, I. D. (1983). Self-concept: The construct validity of interpretations based upon the SDQ. Journal of Personality and Social Psychology, 45, 173-187.

Marsh, H. W., Smith, I. D., & Barnes, J. (1983). Multitrait-multimethod analysis of interpretations based upon the SDQ: Student-teacher agreement on multidimensional ratings of student self-concept. American Educational Research Journal, 20, 333-357.


120

Marsh, H. W., Trautwein, U., Lüdtke, O., Köller, O., & Baumert, J. (2005). Academic self-concept interest grades and standardized test scores: Reciprocal effects models of causal ordering. Child Development, 76, 397-416.

Martinez, S., & Guzman, S. (2013). Gender and racial/ethnic differences in self-reported levels of engagement in high school math and science courses. Hispanic Journal of Behavioral Sciences, 35(3), 407-427.

McGraw, R., Lubienski, S. T., & Strutchens, M. E. (2006). A closer look at gender in NAEP mathematics achievement and affect data: Intersections with achievement, race/ethnicity, and socioeconomic status. Journal of Research in Mathematics Education, 37, 129-150.

Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118.

Miller, P., Wang, S., Sandel, T., & Cho, G. (2002). Self-esteem as folk theory: A comparison of European American and Taiwanese mothers’ beliefs. Parenting: Science and Practice, 2, 209-239.

Mitchell, M. (1993). Situational interest: Its multifaceted structure in the secondary school mathematics classroom. Journal of Educational Psychology, 85(3), 424-436.

Muthén, L. K., & Muthén, B. O. (2010). Mplus user's guide (6th ed.). Los Angeles, CA: Muthén & Muthén.

National Academy of Sciences, National Academy of Engineering, & Institute of Medicine. (2007). Rising above the gathering storm: Energizing and employing America for a brighter economic future. Washington, DC: The National Academies Press.

National Research Council. (2011). Successful K-12 STEM education: Identifying effective approaches in science, technology, engineering, and mathematics. Washington, DC: The National Academies Press.

National Science Board. (2010). Science and engineering indicators. Arlington, VA: National Science Foundation.

National Science Foundation, National Center for Science and Engineering Statistics. (2010). Survey of doctorate recipients. Retrieved from http://www.nsf.gov/statistics/wmpd/2013/pdf/tab9-22.pdf

National Science Foundation, National Center for Science and Engineering Statistics. (2013). Women, minorities, and persons with disabilities in science and


121

engineering: 2013 (Special Report NSF 13-304). Retrieved from http://www.nsf.gov/statistics/wmpd/2013/pdf/nsf13304_digest.pdf

Ng, F., Pomerantz, E., & Lam, S. (2007). European American and Chinese parents’ responses to children’s success and failure: Implications for children’s responses. Developmental Psychology, 43, 1239-1255.

Oakes, J. (1990). Opportunities, achievement, and choice: Women and minority students in science and mathematics. Review of Research in Education, 16, 153-222.

Ogbu, J. (2003). Black American students in an affluent suburb: A study of academic disengagement. Mahwah, NJ: Erlbaum.

Organization for Economic Cooperation and Development. (2003). Education at a glance. Paris, France: OECD.

Pekrun, R. (1992). The impact of emotions on learning and achievement: Toward a theory of cognitive/motivational mediators. Applied Psychology: An International Review, 41(4), 359-376.

Pekrun, R., Elliot, A. J., & Maier, M. A. (2006). Achievement goals and discrete achievement emotions: A theoretical model and prospective test. Journal of Educational Psychology, 98(3), 583-597.

Pekrun, R., Elliot, A. J., & Maier, M. A. (2009). Achievement goals and achievement emotions: Testing a model of their joint relations with academic performance. Journal of Educational Psychology, 101(1), 115-135.

Pekrun, R., Goetz, T., Titz, W., & Perry, R. P. (2002). Academic emotions in students’ self-regulated learning and achievement: A program of qualitative and quantitative research. Educational Psychologist, 37(2), 91-105.

Piaget, J. (1985). The equilibration of cognitive structures. Chicago, IL: University of Chicago Press.

Potenza, M., & Dorans, N. J. (1995). DIF assessment of polytomously scored items: A framework for classification and evaluation. Applied Psychological Measures, 19(1), 23-37.

President's Council of Advisor on Science and Technology. (2010). Prepare and inspire: K-12 education in STEM (science, technology, engineering and math) for America’s future. Retrieved from http://www.whitehouse.gov/sites/default/files/microsites/ostp/pcast-stemed-report.pdf


122

Provasnik, S., Kastberg, D., Ferraro, D., Lemanski, N., Roey, S., & Jenkins, F. (2012). Highlights from TIMSS 2011: Mathematics and science achievement of U.S. fourth- and eighth-grade students in an international context (NCES 2013-009). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.

Raju, N. S. (1988). The area between two item characteristics curves. Psychometrika, 53, 495-502.

Raju, N. S. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207.

Ramirez, M., Teresi, J. A., Holmes, D., Gurland, B., & Lantigua, R. (2006). Differential item functioning (DIF) and teh Mini-Mental State Examination (MMSE): Overview, sample, and issues of translation. Medical Care, 44(Suppl 3), S95-S106.

Reise, S. P., & Yu, J. (1990). Parameter recovery in the graded response model using MULTILOG. Journal of Educational Measurement, 27, 133-144.

Renninger, K. A. (1984). Object-child relations: Implications for both learning and teaching. Children's Environment Quarterly, 1, 3-6.

Renninger, K. A. (1989). Individual patterns in children's play interests. In L. T. Winegar (Ed.), Social interaction and the development of children's understanding (pp. 147-172). Norwood, NJ: Ablex.

Renninger, K. A. (1990). Children's play interests, representation, and activity. In R. Fivush, & J. Hudson (Eds.), Knowing and remembering in young children (pp. 127-165). Cambridge, England: Cambridge University Press.

Renninger, K. A. (1992). Individual interest and development: Implications for theory and practice. In K. A. Renninger, S. Hidi, & A. Krapp (Eds.), The role of interest in learning and development (pp. 361-395). Hillsdale, NJ: Erlbaum.

Renninger, K. A., & Hidi, S. (2002). Student interest and achievement: Developmental issues raised by a case study. In A. Wigfield, & J. S. Eccles (Eds.), Development of achievement motivation (pp. 173-195). New York, NY: Academic Press.

Riegle-Crumb, C. (2005). The cross-national context of the gender gap in math and science. In L. Hedges, & B. Schneider (Eds.), The social organization of schooling (pp. 227-243). New York, NY: Russell Sage Foundation.


123

Riegle-Crumb, C., & Grodsky, E. (2010). Racial-ethnic differences at the intersection of math course-taking and achievement. Sociology of Education, 83(3), 248-270.

Riegle-Crumb, C., Moore, C., & Ramos-Wada, A. (2011). Who wants to have a career in science or math? Exploring adolescents’ future aspirations by gender and race/ethnicity. Science Education, 95(3), 458-476.

Rivers, C., & Barnett, R. C. (2011). The truth about boys and girls: Challenging toxic stereotypes about our children. New York, NY: Columbia University Press.

Rogosa, D. R. (1995). Myth and methods: “Myths about longitudinal research” plus supplemental questions. In J. M. Gottman (Ed.), The analysis of change (pp. 3-65). Hillsdale, NJ: Erlbaum.

Russell, S. H., Hancock, M. P., & McCullough, J. (2007). Benefits of undergraduate research experiences. Science, 316, 548-549.

Samejima, F. (1969). Calibration of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 19, 86-100.

Samejima, F. (1974). Normal ogive model on the continuous response level in the multidimensional latent space. Psychometrika, 39, 111-121.

Sanders, T. (2004, October). No time to waste: The vital role of college and university leaders in improving science and mathematics education. Paper presented at the Invitational Conference on Teacher Preparation and Institutions of Higher Education, Washington, DC.

Scheirer, M. A., & Kraut, R. E. (1979). Increasing educational achievement via self concept. Review of Educational Research, 49(1), 131-150.

Schiefele, U. (1991). Interest, learning, and motivation. Educational Psychologist, 26, 299-323.

Schiefele, U., Krapp, A., & Winteler, A. (1992). Interest as a predictor of academic achievement: A meta-analysis of research. In K. A. Renninger, S. Hidi, & A. Krapp (Eds.), The role of interest in learning and development (pp. 183-212). Hillsdale, NJ: Erlbaum.

Schiefele, U., Winteler, A., & Krapp, A. (1988). Studieninteresse und fachbezogene Wissensstruktur. Psychologie in Erziehung und Unterricht, 35, 106-118.

Schmidt, W. H. (2011, May). STEM reform: Which way to go? Paper presented at the National Research Council Workshop on Successful STEM Education in K-12


124

Schools. Retrieved from http://sites.nationalacademies.org/dbasse/bose/dbasse_080128#.UgEMEFPkDDn

Schmitt, T. A. (2011). Current methodological considerations in exploratory and confirmatory factor analysis. Journal of Psychoeducational Assessment, 29(4), 304-321.

Schraw, G., & Lehman, S. (2001). Situational interest: A review of the literature and directions for future research. Educational Psychology Review, 13(1), 23-52.

Schunk, D. H. (1995). Self-efficacy and education and instruction. In J. E. Maddux (Ed.), Self-efficacy, adaptation, and adjustment (pp. 281-303). New York, NY: Plenum Press.

Schunk, D. H., Pintrich, P. R., & Meece, J. L. (2008). Motivation in education (3rd ed.). Columbus, OH: Merrill.

Scollon, C. N., Diener, E., Oishi, S., & Biswas-Diener, R. (2004). Emotions across cultures and methods. Journal of Cross-Cultural Psychology, 35, 304-326.

Scollon, C. N., Diener, E., Oishi, S., & Biswas-Diener, R. (2005). An experience sampling and cross-cultural investigation of the relation between pleasant and unpleasant emotion. Cognition and Emotion, 19, 27-52.

Seifert, T. L. (1995). Academic goals and emotions: A test of two models. The Journal of Psychology, 129(5), 543-552.

Shavelson, R. J., Hubner, J. J., & Stanton, G. C. (1976). Self-concept: Validation and construct interpretations. Review of Educational Research, 46, 407-441.

Shavelson, R., & Bolus, R. (1981). Self-concept: The interplay of theory and methods. Journal of Educational Psychology, 74(1), 3-17.

Silvia, P. J. (2005). What is interesting? Exploring the appraisal structure of interest. Emotion, 5(1), 89-102.

Silvia, P. J. (2008). Interest—The curious emotion. Current Directions in Psychological Science, 17(1), 57-60.

Simpkins, S. D., & Davis-Kean, P. E. (2005). The intersection between self-concepts and values: Links between beliefs and choices in high school. New Directions for Child and Adolescent Development, 110, 31-47.


125

Skaalvik, E. M. (1997). Issues in research on self-concept. In M. Maehr, & P. R. Pintrich (Eds.), Advances in motivation and achievement (Vol. 10, pp. 51-97). New York, NY: JAI Press.

Skaalvik, E. M., & Rankin, R. J. (1996). Studies of academic self-concept using a Norwegian modification of the SDQ. Paper Presented at the XXVI International Congress of Psychology, Montreal, Canada.

Skaalvik, S., & Skaalvik, E. M. (2004). Gender differences in math and verbal self-concept, performance expectations, and motivation. Sex Roles, 50(3/4), 241-252.

Skinner, B. F. (1953). Science and human behavior. New York, NY: Free Press.

Sörbom, D. (1989). Model modification. Psychomertika, 54, 371-384.

Spelke, E. S. (2005). Sex differences in intrinsic aptitude for mathematics and science. American Psychologist, 60(9), 950-958.

Steele, C., & Aronson, J. (1995). Stereotype threat and the intellectual performance of African Americans. Journal of Personality and Social Psychology, 69(5), 797-811.

Super, D. E. (1990). A life-span, life-space approach to career development. In D. Brown, L. Brooks, & Associates (Eds.), Career choice and development (pp. 197-261). San Francisco, CA: Jossey-Bass.

Swaminathan, H., & Rogers, J. A. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361-370.

Tanzer, N. K. (1996). Interest and competence as components of academic self-concepts. Paper Presented at the XXVI International Congress of Psychology, Montreal, Canada.

Teresi, J. A. (2006). Overview of quantitative measurement methods: Equivalence, invariance, and differential item functioning in health applications. Medical Care, 44(11), S39-S49.

Thissen, D. J., Steinberg, L., & Wainer, H. (1988). Use of item response theory in the study of group differences in trace lines. In H. Wainer, & H. I. Braun (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Erlbaum.

Thissen, D. J., Steinberg, L., & Wainer, H. (1993). Detection of differential item functioning using the parameters of item response models. In H. Wainer, & H.


126

I. Braun (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Erlbaum.

Thorndike, E. L. (1935). The fundamentals of learning. New York, NY: Teachers College Press.

Tourangeau, K., Nord, C., Lê, T., Sorongon, A. G., & Najarian, M. (2009). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K), combined user's manual for the ECLS-K eighth-grade and K-8 full sample data files and electronic codebooks (NCES 2009-004). Washington, DC: National Center for Education Statistics, Institute of Education Sciences.

Tracey, T. J. (2002). Development of interests and competency beliefs: A 1-year longitudinal study of fifth- to eighth-grade students using the ICA-R and structural equation modeling. Journal of Counseling Psychology, 49, 148-163.

Tracey, T. J., & Ward, C. C. (1998). The structure of children's interests and competence perceptions. Journal of Counseling Psychology, 45, 290-303.

Tsui, M. (2007). Gender and mathematics achievement in China and the United States. Gender Issues, 24, 1-11.

U.S. Department of Education, National Center for Education Statistics. (2010). Early Childhood Longitudinal Study, Kindergarten Class of 1998-99 (ECLS-K) kindergarten through fifth grade Approaches to Learning and Self-Description Questionnaire (SDQ) items and public-use data files (NCES 2010-070) [Data file]. Washington, DC: Author.

Valsiner, J. (1992). Interest: A metatheoretical perspective. In K. A. Renninger, S. Hidi, & A. Krapp (Eds.), The role of interest in learning and development (pp. 27-41). Hillsdale, NJ: Erlbaum.

van Langen, A., & Dekkers, H. (2005). Cross-national differences in participating in tertiary science, technology, engineering, and mathematics education. Comparative Education, 41(3), 329-350.

Vandenberg, R. J. (2002). Towards a further understanding of an improvement in measurement invariance methods and procedures. Organizational Research Methods, 5(2), 139-158.

von Baeyer, C., Carlson, G., & Webb, L. (1997). Underprediction of pain in children undergoing ear piercing. Behavioural Research and Therapy, 35, 399-404.

Vygotsky, L. S. (1978). Mind and society: The development of higher psychological processes. Cambridge, MA: Harvard University Press.


127

Watt, H. M., Shapka, J. D., Morris, Z. A., Durik, A. M., Keating, D. P., & Eccles, J. S. (2012). Gendered motivational processes affecting high school mathematics participation, educational aspirations, and career plans: A comparison of samples from Australia, Canada, and the United States. Developmental Psychology, 48(6), 1594-1611.

Weiner, B. (1986). An attributional theory of motivation and emotion. New York, NY: Springer-Verlag.

Weiner, B. (1992). Human motivation: Metaphors, theories, and research. Newburry Park, CA: Sage.

Weiner, B. (1994). Integrating social and personal theories of achievement striving. Review of Educational Research, 64(4), 557-573.

Weisberg, H. F. (2005). The total survey error approach: A guide to the new science of survey research. Chicago, IL: University of Chicago Press.

Wigfield, A. (1994). Expectancy-value theory of achievement motivation: A developmental perspective. Educational Psychology Review, 6(1), 49-78.

Wigfield, A., & Eccles, J. S. (1992). The development of achievement task values: A theoretical analysis. Developmental Review, 12, 265-310.

Wigfield, A., & Eccles, J. S. (2000). Expectancy-value theory of achievement motivation. Contemporary Educational Psychology, 25(1), 68-81.

Wigfield, A., & Eccles, J. S. (2002). The development of achievement motivation. San Diego, CA: Academic Press.

Wigfield, A., Eccles, J. S., Schiefele, U., Roeser, R. W., & Davis-Kean, P. (2006). The development of achievement motivation. In N. Eisenberg (Ed.), Handbook of child psychology (6th ed., Vol. III). New York, NY: Wiley.

Woods, C. M. (2009). Empirical selection of anchors for tests of differential item functioning. Applied Psychological Measurement, 33(1), 42-57.

Woods, C. M., Cai, L., & Wang, M. (2012). The Langer-improved Wald test for DIF testing with multiple groups: Evaluation and comparison to two-group IRT. Educational and Psychological Measurement, 73(3), 532-547.

Xie, Y., & Shauman, K. A. (2003). Women in science: Career processes and outcomes. Cambridge, MA: Harvard University Press.

Yu, C. (2002). Evaluating cutoff criteria of model fit indices for latent variable models with binary and continuous outcomes. (Doctoral dissertation, University of


128

California, Los Angeles). Retrieved from http://statmodel2.com/download/Yudissertation.pdf

Zimmerman, B. J., & Schunk, D. H. (2003). Albert Bandura: The scholar and his contributions to educational psychology. In B. J. Zimmerman, & D. H. Schunk (Eds.), Educational psychology: A century of contributions (pp. 431-457). Mahwah, NJ: Lawrence Erlbaum Associates.

Documents

Copyright 2014, Tianlan Wei