26
86 Assessment Matters 6 : 2014 Using Rasch measurement to improve analytical marking keys Pina Tarricone and Martin G. Cooper Abstract This article describes the use of Rasch measurement to improve criteria- based analytical marking keys. The instruments and data described result from a research project which investigated the use and assessment of digital portfolios in the Western Australian senior secondary design course. The study involved two phases and two separate data sets. A criteria-based analytical marking key was used to score the portfolios in Phase 1, and a refined and improved version was used in Phase 2. Refinement of the criteria-based analytical marking key was undertaken as a result of Rasch analysis, and this is described and illustrated. Results showed that Rasch measurement can inform the improvement of marking keys in terms of criteria, targeting, number of score points, utilisation of score points, and item discrimination. Background and purpose During 2011 and 2012 the Centre for Schooling and Learning Technologies (CSaLT) at Edith Cowan University in cooperation with the School Curriculum and Standards Authority (SCSA) undertook a multifaceted study which set out to investigate digital representations of student practical work in a number of Western Australian senior secondary courses for the purpose of online marking and summative assessment (Newhouse & Tarricone, in press). This article reports on instruments and data from the senior secondary design course which was involved in the project. Students in this course were required to submit a paper portfolio of their design work which was digitised for analytical marking. The study comprised two phases and two data sets, and marking of the portfolios was completed using a criteria- based analytical marking key which was analysed, refined, and improved

Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

86 Assessment Matters 6 : 2014

Using Rasch measurement to improve analytical marking keys

Pina Tarricone and Martin G. Cooper

AbstractThis article describes the use of Rasch measurement to improve criteria-based analytical marking keys. The instruments and data described result from a research project which investigated the use and assessment of digital portfolios in the Western Australian senior secondary design course. The study involved two phases and two separate data sets. A criteria-based analytical marking key was used to score the portfolios in Phase 1, and a refined and improved version was used in Phase 2. Refinement of the criteria-based analytical marking key was undertaken as a result of Rasch analysis, and this is described and illustrated. Results showed that Rasch measurement can inform the improvement of marking keys in terms of criteria, targeting, number of score points, utilisation of score points, and item discrimination.

Background and purposeDuring 2011 and 2012 the Centre for Schooling and learning Technologies (CSalT) at edith Cowan University in cooperation with the School Curriculum and Standards Authority (SCSA) undertook a multifaceted study which set out to investigate digital representations of student practical work in a number of Western Australian senior secondary courses for the purpose of online marking and summative assessment (newhouse & Tarricone, in press).

This article reports on instruments and data from the senior secondary design course which was involved in the project. Students in this course were required to submit a paper portfolio of their design work which was digitised for analytical marking. The study comprised two phases and two data sets, and marking of the portfolios was completed using a criteria-based analytical marking key which was analysed, refined, and improved

Page 2: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 87

between phases. rasch analyses and the subsequent improvement of the Phase 1 and Phase 2 criteria-based analytical marking keys are presented and discussed in this article.

Literature reviewThe term analytical marking, also known as analytical scoring, refers to a scoring method that involves allocating marks to items in the piece to be assessed and then summing these scores to obtain a measure of a student’s performance. Analytical marking is the preferred method of scoring by assessment bodies for high-stakes summative assessment to create fine-grained measurement scales for university entrance and professional certification. This method is used because the resulting scores can be used to position each student relative to the overall performance of the population on a linear scale (Andrich, 2005; Tognolini, 2006).

Analytical marking methods have been critiqued in recent years by authors such as Pollitt (2004) who argue that this form of summative assessment is unlikely to produce an accurate measure. Despite these critiques, analytical marking, particularly using standards-reference frameworks and rubrics, has remained popular (Andrich, 2005; newhouse, 2013; Panadero & Jonsson, 2013; Tognolini, 2006). In Australia this may be in part because all states have a standards-referenced curriculum, meaning that educational outcomes must be clearly described with descriptors that lend themselves to being incorporated into rubric-based assessment structures (Andrich, 2005; Tognolini, 2006).

An important stage of analytical marking using scoring rubrics occurs before its actual use. Developing analytical marking keys that are fair, accurate, valid, and reliable can be challenging. Assessment developers need to be trained to design analytical marking keys that can produce reasonably reliable scores (Chi, 2001; Popham, 1997; Tierney & Simon, 2004). Also critical is the training of assessors in the application of the analytical marking key in terms of ensuring a common understanding of the descriptors that have been employed (newhouse, 2011). Jonsson and Svingby (2007), through a meta-analysis of 75 studies, concluded that: “the reliable scoring of performance assessments can be enhanced by the use of rubrics. In relation to reliability issues, rubrics should be analytic,

usIng rasch measurement to Improve analytIcal markIng keys

Page 3: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

88 Assessment Matters 6 : 2014

topic specific, and complemented with exemplars and/or rater training” (p. 414).

Scoring rubrics are composed of performance criteria and descriptors (Humphry & Heldsinger, 2009). Commonly, the effectiveness of scoring rubrics may be judged based on the correlation between scores allocated by markers using the rubric. However, this merely shows the extent to which the markers rank students in a similar way and doesn’t give any information regarding the structure and measurement quality of the scoring rubric (Popham, 1997). Considering this, there is a need for tools to be able to look more closely at scoring rubrics in terms of the categories/score points and targeting. Targeting in measurement refers to how well item difficulty matches with the ability of the test takers (Wright & Stone, 1979). generally, there is an awareness that criteria can be written differently and refined. It is less commonly understood how to improve the targeting of the instrument and adjust the number of score points/categories. Although, there have been challenges to the use of criteria-based analytical assessment some of the measurement and reliability concerns can be addressed through the use of rasch modelling to create an interval scale on which the student scores lie. In this article, the term criteria-based analytical marking key is used because it provides a precise description of the structure of the scoring rubric.

latent traits, including ability, attitudes, and beliefs, can be measured by the application of the theory of additive conjoint measurement (luce & Tukey, 1964). latent traits are those that cannot be directly observed. Although they may not be directly observed, observations can be made meaningful by the application of ordering. rank ordering is the basis of the theory of conjoint measurement and can be applied to observations of ability and performance (Cavanagh & Waugh, 2011). It is the expert observations of the latent trait, in this case the “student design capability”, that are made in reference to a criteria-based analytical marking key. Design capability is described by Kimbell and Stables (2007) as the “ability to pursue a task with imagination and rigour and draw it to a resolution” (p. 18). It is important to note that this article does not further articulate and describe the latent trait, but focuses on the improvement of criteria-based analytical marking keys.

tarrIcone and cooper

Page 4: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 89

The quantification of observations are the basis of the analysis using the rasch polytomous rating scale model (Andrich, 1978; rasch, 1960). The rasch measurement model predicts the magnitude of the latent trait, represented as items, for each person against the measurement scale, which is invariant (Cavanagh & Waugh, 2011). for example, in this study the rasch model predicts the magnitude of the latent trait—student design capability—against the scores resulting from using the criteria-based analytical marking key (consisting of a number of items) which in this case is the invariant measurement scale. The model determines the probability of a particular response on an item on the basis of the difference between the item difficulty and the person ability (Andrich, 1988; rasch, 1960).

The major advantage of rasch measurement modelling for assessment bodies is that it allows the reliable ordering of students on an interval scale. Additionally, this statistical technique enables the interrogation and refinement of analytical marking keys leading to improvements in the efficiency and reliability of the marking process (Heldsinger, 2012; Humphry & Heldsinger, 2009). rasch measurement provides information on the reliability of the data, functionality of the categories or score points, targeting in terms of item difficulty, and scale unidimensionality (Cavanagh & romanoski, 2008; Cavanagh & Waugh, 2011). This article describes an example of this process in the marking of portfolios for the purpose of summative assessment in a senior secondary design course.

MethodA small number of Western Australian schools teaching the design course were invited to participate in the study. These schools were from all school sectors and included both metropolitan and country schools. As a result there were six schools, six teachers and 81 Year 12 students participating in Phase 1. In Phase 2 there were 14 schools, 14 teachers, and 113 students in the data set. Year 11 classes were involved in Phase 2 as it was less intrusive to their studies because data were collected late in the school year. In the design course the final task requirements and assessment structure were the same for Years 11 and 12. Hence the conditions for the study were the same over the two phases.

usIng rasch measurement to Improve analytIcal markIng keys

Page 5: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

90 Assessment Matters 6 : 2014

The students were required to create a portfolio of up to 15 single-sided A3 paper pages, illustrating their design capability. Up to three projects were included in the portfolio to provide evidence of understanding and skills in development and production of design. The students created their portfolio electronically and on paper (e.g., sketches) and these were then digitised into PDf documents using a computer, or scanner, or both.

The final digitised portfolios were stored on a university server and were accessed and marked online through a custom filemaker Pro database (see figure 1) (Williams & Campbell, 2013). The analytical assessors were experienced Western Australian Certificate of Education (WACE) design markers who had an expert understanding of the marking criteria. These markers were trained by the researchers to use the online system.

for both phases, each portfolio was marked by the same two assessors using the official assessment body (School Curriculum and Standards

Figure 1. Analytical marking tool screen for the design course (Phase 2 showing)

tarrIcone and cooper

Page 6: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 91

Authority) criteria-based analytical marking key from the design course assessment support documentation (see Appendix). These marking keys were included in the online marking tool (see figure 1). The assessors accessed the portfolios from a location of their choice. The online marking system allowed the assessor to access each portfolio and make judgements regarding the quality of the work against the criteria. The assessors scored each criterion by clicking on the appropriate radio button and the overall score was automatically calculated and stored.

In Phase 1, there were two markers who each assessed 82 portfolios against six criteria (see Table 1 and Appendix). The criteria categories were marked using 0–6, 0–8, or 0–10 score points.

Table 1. Phase 1 design analytical marking criteria

Label Criteria (items) Score points (categories)C1 Design elements and principles 0–6C2 Design process 0–10C3 Analysis and innovation 0–10C4 Experimentation and selectivity 0–10C5 Production knowledge and skills 0–10C6 Communication and visual literacies 0–8

nB: In rasch terminology the criteria are referred to as items.

In Phase 2, the same two markers as in Phase 1 each assessed 113 portfolios against eight criteria (see Table 2 and Appendix). The criteria categories were marked using 0–3 or 0–5 score points. This criteria-based analytical marking key was a modified version of the one used in Phase 1 (see Table 4).

Table 2. Phase 2 Design analytical marking criteria

Label Criteria (items) Score points (categories)C1 Application of planning methods 0–5C2 Application of skills, techniques and procedures 0–5C3 Experimentation and evaluation 0–5C4 Analysis 0–5C5 Originality 0–5C6 Design elements and principles 0–3C7 Design process 0–3C8 Communication and visual literacies 0–3

nB: In rasch terminology the criteria are referred to as items.

usIng rasch measurement to Improve analytIcal markIng keys

Page 7: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

92 Assessment Matters 6 : 2014

Method of analysisData were extracted from the filemaker Pro database in the form of an excel spreadsheet. The software rUMM2020 was used for the rasch analysis (Andrich, Sheridan, & luo, 2005). Before importing the data into rUMM2020 we made sure that the data structure, the criteria, and categories/score points accurately reflected the criteria-based analytical marking key. The analysis was run using the rasch polytomous model and this allowed the researchers to determine how well the criteria were functioning in relation to the makers’ judgement of the ability of the students.

Typically, a rasch analysis is used to indicate the “degree to which the respondents answer the items of ‘different difficulty’ in a logical and consistent manner” (Cavanagh & romanoski, 2008, p. 31). However, in this case it is the assessors’ judgement of students’ work that is being examined against the criteria (items). The student design capability, reflected in the student score, is the latent trait that is being measured and this is indicated by each of the 12 criteria–marker combinations (see figure 2). That is, the items are not the criteria alone, but the criteria–marker combinations. The ultimate purpose of the rasch analysis was to place the students on an interval scale, assuming that the analysis shows that data fit the model. The markers’ judgements inform the functionality of the criteria, and therefore each score point. This provides information about the student performance on the criteria based on the criteria-marker combination. To understand what it means to be a high scoring or low scoring student, on the design capability latent trait, the item criteria descriptors in the criteria-based analytical marking key should be examined (see Appendix).

figure 2 depicts the data structure for Phase 1 showing how the latent trait results in each marker’s score for each criterion. Each of the two markers allocated scores to the six criteria resulting in 12 items. This data structure was replicated for Phase 2, however, it comprised eight criteria resulting in 16 items.

Using all the multiple criteria–marker combinations in a single rasch analysis is a useful approach as it allows the examination of both markers and their use of the criteria on one scale. If two analyses were conducted,

tarrIcone and cooper

Page 8: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 93

one for each marker, two rasch measurement scales would be produced and it would be difficult to evaluate the overall functionality of the criteria-based analytical marking key. The following section provides the rasch analysis based on this structure, as shown in figure 2, for Phase 1 and Phase 2.

Figure 2. The latent trait structure for design capability

Rasch analysis and results—Phase 1Summary statisticsThe rUMM2020 software gives a variety of outputs. It is good practice to examine the summary test-of-fit statistics output before proceeding further. The summary test-of-fit statistics output for these data is shown in Table 3.

The item–student interaction indicates the degree to which the students were scored by the assessors on the criteria (items) in a consistent manner. When the data fit the model the fit residuals should have means near zero and standard deviations near 1 (Cavanagh & romanoski, 2008). In Phase 1, the mean item fit residual of 0.43 and the student fit residual of -0.61 indicate that overall the data fit model and the criteria were scored consistently. The item fit residual standard deviations of 0.63 and the student fit residual standard deviation of 1.66 indicate the variance in both the item and student data were acceptable.

usIng rasch measurement to Improve analytIcal markIng keys

Page 9: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

94 Assessment Matters 6 : 2014

The item–trait interaction indicates whether the scale is unidimensional. When the data fit the model the item trait has a chi-square probability greater than 0.05. In Phase 1, the probability was high at 0.94. This result is evidence that the scale is measuring a unidimensional trait: the student design capability (Cavanagh & Waugh, 2011).

The proportion of variance that is considered true indicates the reliability of a set of measurements. However, variance comes from two sources: random or error variance; and true variance (guilford & fruchter, 1978). In a rasch analysis the proportion of observed variance considered true is indicated by Cronbach’s alpha and ideally this should be close to 1. In Phase 1, this was high at 0.94. Additionally, the power of test-of-fit statistics shows the fit between the data and the model was excellent, with a person separation index of 0.95. In rasch analysis the person separation index is analogous to reliability.

Table 3. Summary of Rasch psychometric statistics for Phase 1 design analysis

Item–student InteractionItems (n = 12) Students (n = 81)

Location Fit residual Location Fit residualMean 0.00 0.43 2.46 -0.61SD 1.27 0.63 1.69 1.66Item–trait interactionTotal item chi square 14.26

Total degrees of freedom 24.00Total chi square probability

0.94

Reliability indices Cronbach’s alpha 0.94Power of test-of-fitPower is excellentPerson separation index 0.95

The summary test-of-fit statistics shows that the data fit the model with high reliability. rUMM2020 provides a number of outputs that enable further interrogation of the data. The following sections cover the relevant outputs such as the person–item location, category probability curves

tarrIcone and cooper

Page 10: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 95

and thresholds, and finally the item-characteristic curves. Each of these provides information on the functionality of the criteria-based analytical marking key.

Person–item locationThe rasch algorithm examines the interactions between students and items and calculates a location (logit) for each student, based on their ability, and for each item, based on its difficulty, along an interval scale. The person–item location distribution represents or displays these locations (see figure 3). The top half of the graph is the student locations and the bottom half shows the item locations. By examining this graph we can see the degree of fit between students and items. Ideally, the student locations should mirror the item locations; this would show that the items were well targeted to the student ability. Items that are easy will appear on the left hand side of the item (bottom) scale and the difficult items on the right hand side of the scale. In Phase 1, the graph indicates that the items were too easy for these students as shown by the range of person locations being from -1.6 to 7.5 and the item locations from -2.5 to 1.4. That is, the criteria as interpreted by the markers were too easy for the ability level of the students; they tended to allocate high scores.

Figure 3. Phase 1 person–item location distribution

usIng rasch measurement to Improve analytIcal markIng keys

Page 11: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

96 Assessment Matters 6 : 2014

Category probability curves and thresholdsThe rasch output includes curves that show the probability of a given category being chosen across the range of person locations: these are called category probability curves. The category probability curves are predictive, showing how the score points would likely be used for each category. An example is shown in figure 4 for Criteria 5 Marker 1 (C5M1).

The numbers at the top of the curves are the 10 categories or score points, numbered from 0 to 9, for this criterion 5. That is, for a student that is given a person–location score (logit) of 1 it is most likely (probability of 0.46) that they will be given a score point of 5. However, there is some chance they would be given a score of 4 (probability of 0.2) or 6 (probability of 0.27). There is also a very small chance that they would be given a score of 3 (probability of 0.02) or 7 (probability of 0.04). The most useful aspect of this category probability curve is that it shows the thresholds between categories or score points. The thresholds are the points of intersections between consecutive score points or categories. for example, the threshold between score point 1 and 2 is at -3.4 and the probability of each score point was 0.35. When examining this graph we are mostly interested in whether the thresholds are ordered correctly. Ordering means, for example, that the intersection between score point 1

Figure 4. Criteria 5 Marker 1 category probability curve

tarrIcone and cooper

Page 12: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 97

and 2 occurs before the intersection between 2 and 3 and so on. As a counter example, figure 5 shows the category probability curves for Criteria 5 Marker 2 (C5M2). It can be seen in this graph that the intersection points of score points 0, 1, 2, and 3 are disordered. In comparison to Criteria 5 Marker 1 (C5M1) you can see that score points are not functioning well and not discriminating effectively between students.

for Phase 1, the category probability curves showed that for the criteria that had 10 score points (Criteria 2, 3, 4, and 5) there were too many score points, which did not provide enough discrimination. Overall, Marker 1 showed a better use of the score points than Marker 2 (see figures 4 and 5). This may be due to one or more elements including the wording of the criteria, the understanding of the criteria, marking with the criteria was easy and the task requirements were too easy. figure 5 shows the category probability curve for Criteria 5 Marker 2 (C5M2).

Of the six criteria there were no disordered thresholds for Marker 1, but the analysis shows that there were disordered thresholds for Marker 2. In addition to Criteria 5, 3, and 4 were also disordered for score points 0, 1, and 2. These criteria have 10 score points. for most of the criteria, including those that are ordered, the lower scores points were not utilised. This indicates that the criteria could be rewritten or redefined to reduce the number of categories to increase the discrimination between score points.

Figure 5. Criteria 5 Marker 2 category probability curve

usIng rasch measurement to Improve analytIcal markIng keys

Page 13: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

98 Assessment Matters 6 : 2014

Item-characteristic curvesItem-characteristic curves plot the student (person) location score (logit) against the expected score predicted by the rasch model. for example, for C5M1 a student who scored 1 would have an expected score of 5.11 (see figure 6). The slope of the curve shows how well the item is discriminating across the range of student scores. In figure 6, the even slope shows that the item is discriminating well across the range (-4 to 7). The students are classified into class intervals—in this case groups of 27 represented as black dots—to represent how well the observed student scores match with those predicted by the model (see figure 6). The slope for C5M2 varies over the

Figure 6. Criteria 5 Marker 1 item characteristic curve

Figure 7. Criteria 5 Marker 2 item Phase curve

tarrIcone and cooper

Page 14: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 99

range of observed scores (see figure 7). from -2 to -0.5 the curve is very steep, indicating a small change in person location results in a large change in the expected score. The steepness in the C5M2 curve (Figure 7) reflects the inconsistent use of score points 0, 1, 2, and 3 as shown by the disordered thresholds in figure 5. Whereas, C5M1 allocated the score points more ordered consistent way (see figure 4).

not all of the item-characteristic curves are smooth, and there seems to be some lack of good discrimination between students of higher and lower ability.

Implications and modifications as a result of the analysisA Phase 1 report was written for the School Curriculum and Standards Authority, outlining the results of the Rasch analysis. The report identified the need to modify the analytical key by reducing the number of score points and to consider having no more than five score points. The report also stated that the criteria had complex wording making them difficult to interpret, and this may be one reason for the markers allocating different scores. Although this was suggested, the research team did not have any involvement in the modification of the Phase 2 analytical marking key. The changes to the analytical marking key from Phase 1 to Phase 2 are shown in Table 4. The full analytical marking keys for Phases 1 and 2 are provided in the Appendix.

Table 4. Outline of changes from Phase 1 to the Phase 2 analytical marking key

Phase 1 Phase 2Criteria 6 criteria 8 criteriaScore points 0–6, 0–8, 0–10 0–3, 0–5Score point description

Complex wording Concise wording

Allocation of score points

2 scores points per score point description

1 score point per score point description

usIng rasch measurement to Improve analytIcal markIng keys

Page 15: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

100 Assessment Matters 6 : 2014

Rasch analysis and results—Phase 2

Summary test-of-fit statisticsThe rasch summary statistics for Phase 2 for these data are shown in Table 5. The mean fit residuals and the fit residual standard deviations were all within acceptable limits. The chi-square probability was high, indicating that the scale—the student design capability—was unidimensional. Both Cronbach’s alpha and the person separation index were high, indicating that the analysis had a high level of reliability.

Table 5. Summary of Rasch psychometric statistics for Phase 2 design

Item–student InteractionItems (n = 16) Students (n = 113)

Location Fit Residual Location Fit ResidualMean 0.00 -0.01 1.60 -0.48SD 0.81 1.04 1.86 1.55Item–trait interactionTotal item chi square 20.68Total degrees of freedom 32Total chi square probability

0.94

Reliability indices Cronbach’s alpha 0.94Power of test-of-fitPower is excellentPerson separation index 0.94

Person–item locationfor Phase 2, the person–item location distribution shows that the item difficulty was somewhat better targeted to the students than in Phase 1 (see figure 8). However, the distribution of item scores was still low from -2.5 to 1.4 relative to person locations from -1.5 to 7.5, indicating that the modified criteria were still very easy for the ability of the students.

tarrIcone and cooper

Page 16: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 101

Category probability curves and thresholdsUnlike Phase 1, there were no disordered thresholds for Marker 1 or 2 in Phase 2. The structure of the criteria and the score points are generally functioning well and the full range of score points was used. However, for the criteria 1 to 5 that had six score points (0–5), the lowest score point was not utilised.

Item-characteristic curvesThe item-characteristic curves show that for some criteria there is not good discrimination for some person location ranges. for example, the following curve for C6M1 shows a relative lack of discrimination between 0 and 3 (see figure 9).

DiscussionThis article has illustrated the use of an advanced statistical technique, rasch measurement modelling, to examine the functionality of a criteria-based analytical marking key. This technique is desirable for assessment bodies as it results in students being placed on an interval scale ordering them according to their ability on the latent trait being measured. rasch measurement also provides outputs which inform assessment bodies

Figure 8. Person–item location design Phase 2 analytical marking

usIng rasch measurement to Improve analytIcal markIng keys

Page 17: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

102 Assessment Matters 6 : 2014

about the functionality and reliability of analytical marking keys. This technique could also be used by schools and teachers in situations where a large cohort undertakes an assessment task that is marked using an analytical marking key.

The summary test-of-fit statistics show whether the data fit the Rasch model and whether they represent the unidimensional trait being measured. In Phases 1 and 2 of this study, the marker responses to the analytical marking key fitted the Rasch model well. However, this is an overall fit and it is likely that there are ways that the analytical marking key can be improved. These may include improvements in targeting, number, and utilisation of score points, and item discrimination across the range of scores.

Problems with the targeting of items can be identified by examining the person–item distribution. This will show whether the items are well targeted to the persons undertaking the assessment. It is clear in Phase 2 that the analytical marking key was improved. This was indicated by some improvement in the targeting of items shown in the person–item distribution. However, the items were still too easy, and further enhancement could be made in this area.

Figure 9. Item-characteristic curve design Phase 2 analytical marking

tarrIcone and cooper

Page 18: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 103

The functionality of the score points for each item is represented in the category-probability curves. In Phase 1 there were disordered thresholds, and this could be due to a number of factors. These include poor marker understanding of the criteria, over-complex and/or poorly worded criteria, too many score points, or a range of score points for a descriptor, making score-point utilisation difficult.

Having identified these factors, through Rasch measurement, it is important to qualitatively examine the analytical marking key to attempt to identify the source of the problems. It is important to review the construct being measured to ensure that each criteria descriptor accurately reflects the construct. In Phase 2, the increased number of criteria indicates that a review of the construct was undertaken. The wording of the criteria for Phase 2 was refined to reduce complexity and improve marker consistency. It is also important to ensure that markers are well trained and share a common understanding of the criteria.

reliable marking is hindered by criteria where one descriptor allows for a range of marks. In Phase 2 the analytical marking key was improved by having a single score point for each criteria descriptor, which is preferable as it removes ambiguity. Additionally, the number of score points was reduced for each criterion. This reduction was reflected in the results for Phase 2 where there were no disordered thresholds and score points were utilised across the entire range for each criteria.

finally, the item characteristic curves show how well the item is discriminating across the range of person–location scores. Problems highlighted from these curves (items over or under discriminating for certain ranges of scores) can usually be resolved by dealing with the same issues as those resulting from the category probability curves.

ConclusionThis article has shown how an analytical marking key can be interrogated and improved by using rasch measurement. The creation and design of criteria-based analytical marking keys is not a transparent process, and what appears to be a useful assessment tool may not be functioning as desired. The rasch model enables the score points within each criterion of the marking key to be examined to determine if they are functioning

usIng rasch measurement to Improve analytIcal markIng keys

Page 19: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

104 Assessment Matters 6 : 2014

correctly. As a result the analytical marking key can be modified, readministered and reanalysed leading to a cycle of improvement. furthermore, the scores resulting from the rasch analysis provide a highly-reliable ordering of scores along an interval scale which is critical goal for assessment bodies undertaking high-stakes assessment.

AcknowledgementsThe study was the work of a research team at the Centre for Schooling and learning Technologies (CSalT) at edith Cowan University in partnership with the School Curriculum and Standards Authority (SCSA) in Western Australia. The study was supported by an Australian research Council (ArC) linkage research grant. The research team comprised Associate Professor Paul newhouse, Dr Jeremy Pagram, Dr lisa Paris, Professor Mark Hackling, Dr Pina Tarricone, Dr Martin Cooper and Dr Alistair Campbell.

ReferencesAndrich, D. (1978). A rating formulation for ordered response categories.

Psychometrika, 43(4), 561–573. Andrich, D. (1988). Rasch models for measurement. Beverly Hills, CA: Sage

Publications.Andrich, D. (2005). A report to the Curriculum Council of Western Australia regarding

assessment for tertiary selection (pp. 1–53). Perth, Australia: Curriculum Council of Western Australia.

Andrich, D., Sheridan, B., & luo, g. (2005). RUMM2020. A windows-based item analysis program employing Rasch unidimensional measurement models. Perth, WA: rUMM laboratory.

Cavanagh, r. f., & romanoski, J. T. (2008). linking student behaviours and attitudes towards information and communication technology with learning processes, teacher instruction and classroom environment. In e. P. Velliotis (ed.), Classroom culture and dynamics (pp. 19–69). new York: nova Science Publishers.

Cavanagh, r. f., & Waugh, r. f. (2011). The utility of rasch measurement for learning environments research. In r. f. Cavanagh & r. f. Waugh (eds.), Applications of Rasch measurement in learning environments research (pp. 3–15). rotterdam: Sense Publishers.

Chi, e. (2001). Comparing holistic and analytic scoring for performance assessment with many-facet rasch model. Journal of Applied Measurement, 2(4), 379–388.

tarrIcone and cooper

Page 20: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 105

guilford, J. P., & fruchter, B. (1978). Fundamental statistics in psychology and education (6th ed.). new York, USA: Mcgraw-Hill.

Heldsinger, S. (2012). Using a measurement paradigm to guide classroom assessment processes. In C. f. Webber & J. l. lupart (eds.), Leading student assessment: Studies in educational leadership (pp. 241–262). Dordrecht: Springer.

Humphry, S., & Heldsinger, S. (2009, August). Do rubrics help to inform and direct teaching practice? Paper presented at the ACer research Conference: Assessment and Student learning: Collecting, interpreting and using data to inform teaching, Perth, Western Australia.

Jonsson, A., & Svingby, g. (2007). The use of scoring rubrics: reliability, validity and educational consequences. Educational Research Review, 2, 130–144.

Kimbell, r., & Stables, K. (2007). Researching design learning: Issues and findings from two decades of research and development. Dordrecht, The netherlands: Springer.

luce, r. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1–27. doi: 10.1016/0022-2496(64)90015-x

newhouse, C. P. (2011). Comparative pairs marking supports authentic assessment. In r. f. Cavanagh & r. f. Waugh (eds.), Applications of Rasch measurement in learning environments research (pp. 141–180). rotterdam, The netherlands: Sense Publishers.

newhouse, C. P. (2013). literature review and conceptual framework. In P. J. Williams & C. P. newhouse (eds.), Digital representations of student performance for assessment (pp. 9–28). rotterdam, The netherlands: Sense Publishers.

newhouse, C. P., & Tarricone, P. (in press). Digitising practical production work for high-stakes assessments. Canadian Journal of Learning and Technology.

Panadero, e., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment purposes revisited: A review. Educational Research Review, 9, 129–144.

Pollitt, A. (2004, June). Let’s stop marking exams. Paper presented at the International Association for educational Assessment conference, Philadelphia.

Popham, W. J. (1997). What’s wrong and what’s right with rubrics. Educational Leadership, 72–75.

rasch, g. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: neilson & lydiche.

Tierney, R., & Simon, M. (2004). What’s still wrong with rubrics: Focussing on the consistency of performance criteria across scale levels. Practical Assessment, Research and Evaluation, 9(2).

Tognolini, J. (2006). Meeting the challenge of assessing in a standards based education system. Perth, Australia: Curriculum Council of Western Australia.

usIng rasch measurement to Improve analytIcal markIng keys

Page 21: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

106 Assessment Matters 6 : 2014

Williams, P. J., & Campbell, A. (2013). Method and analysis. In P. J. Williams & C. P. newhouse (eds.), Digital representations of student performance for assessment (pp. 29–48). rotterdam, The netherlands: Sense Publishers.

Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MeSA Press.

AppendixPhase 1 criteria-based analytical marking key provided for the research by the School Curriculum and Standards Authority (SCSA)

2011 Design Course Marking Key Practical (portfolio) 50% (50 marks)Description MarksCriterion 1: Design elements and principles(Application of design principles, use of design elements)Uses elements and principles of design in an original way, to communicate highly effectively and creatively.Demonstrates a sophisticated level of discernment in selecting and applying relevant design and composition principles.

5–6

Uses elements and principles of design in an original way, to communicate effectively.Displays adherence to design and composition principles and uses them effectively demonstrating selectivity and discernment.

3–4

Uses the elements and principles of design inconsistently.Shows minimal understanding of design and composition principles.

1–2

Uses the elements and principles of design inappropriately.Shows little understanding of design and composition.

0

Criterion 2: Design process(Brainstorming, idea generation methods, visualisation of concepts)Demonstrates consistent and sophisticated use of appropriate design processes. 5–6Demonstrates competent use of appropriate design processes. 3–4Applies a design processes inconsistently and/or with errors. 1–2Little attempt to apply a design process. 0Criterion 3: Analysis and innovation(Analysis of information and translation into design concepts, application of interrelated thinking and innovative development process)Sophisticated analysis of information.Complex ideas are used to produce highly innovative original solutions.

9–10

tarrIcone and cooper

Page 22: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 107

2011 Design Course Marking Key Practical (portfolio) 50% (50 marks)Description MarksWell analysed, considering all relevant information.Designs produced are original and highly innovative.

7–8

Well analysed, considering some information.Designs produced are original and innovative.

5–6

Clearly analysed information.Produces effective designs without significant innovation.

3–4

Simple analysis, mainly regurgitation of information.Simple development of design concepts.

1–2

Minimal analysis and engagement with information.Some development of design concept, with unexplained gaps in development.

0

Criterion 4: Experimentation and selectivity(Use of interpretive skills and problem solving)Extensive and sophisticated experimentation with highly appropriate and diverseinterpretive and problem solving skills techniques and procedures.

9–10

Extensive experimentation with appropriate and diverse interpretive and problem solving skills techniques and procedures. 7–8

Experimentation with appropriate interpretive and problem solving skills techniques and procedures. 5–6

Some experimentation with mostly appropriate interpretive and problem solving skills techniques and procedures.

3–4

Limited experimentation is evident, interpretive and problem solving skills techniques and procedures selected are unsuited to the task.

1–2

Minimal experimentation or problem solving is evident. 0Criterion 5: Production knowledge and skills(Selection and use of a diverse range of skills, techniques and procedures, application of planning and production methods)Production methods and techniques are executed to an industry standard.Extensive, detailed and successful use of codes and conventions (relevant industry standards are fully met).

9–10

Production methods and techniques are executed to near industry standard.Employs a broad range of codes and conventions suited to the task (relevant industry standards are met).

7–8

Production methods and techniques are executed in a competent manner.Employs a moderate range of codes and conventions generally suited to the task (relevant industry standards are mainly met).

5–6

usIng rasch measurement to Improve analytIcal markIng keys

Page 23: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

108 Assessment Matters 6 : 2014

2011 Design Course Marking Key Practical (portfolio) 50% (50 marks)Description MarksCriterion 5: Production knowledge and skills (continued)Production methods and techniques are executed in a basic manner.Employs a limited range of codes and conventions, some of which are unsuitable to the task (relevant industry standards are sometimes met).

3–4

Some production methods are used but significant mistakes are made.Employs a very limited range of codes and conventions with little development and/or errors (relevant industry standards are not met).

1–2

Limited use of production methods. Demonstrates very limited understanding of codes and conventions. Relevant industry standards missing. 0

Criterion 6: Communication and visual literacies(Ability to interpret design brief, ability to construct a visual image that conveys a message)Highly communicative work, conveying messages very clearly. 7–8Interprets the brief appropriately and communicates effectively to the intended audience. 5–6

Satisfies the brief adequately and communicates to the intended audience. 3–4The brief is inadequately addressed; the needs of the audience are only touched on in the work. 1 – 2

Little evidence that the brief is addressed and the needs of the audience are lacking in the work. 0

Phase 2 criteria-based analytical marking key provided for the research by the School Curriculum and Standards Authority (SCSA)

2012 Design Course Marking Key Practical (portfolio) 50% (34 marks)Description MarksCriterion 1: Application of planning methods(Evidence of use of planning methods)Detailed planning relevant to context is completed to an excellent standard. 5Planning relevant to context is completed to a high standard. 4Planning relevant to context is completed in a competent manner. 3Some planning relevant to context is evident. 2Limited use of planning is evident. 1No use of planning methods. 0Total 5Criterion 2: Application of skills, techniques and procedures(Evidence of competence in skills, techniques and procedures

tarrIcone and cooper

Page 24: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 109

2012 Design Course Marking Key Practical (portfolio) 50% (34 marks)Description MarksConsistently employs a broad range of skills, techniques and procedures suited to the task. 5

Competently employs a range of skills, techniques and procedures generally suited to the task. 4

Employs a moderate range of skills, techniques and procedures, most of which are suitable to the task. 3

Employs a limited range of skills, techniques and procedures sometimes unsuited to the task. 2

Demonstrates very limited use of skills, techniques and procedures. 1Does not demonstrate use of skills, techniques and procedures 0Total 5Criterion 3: Experimentation and evaluation(Evidence of experimentation, choice of design solutions and evaluationExtensive experimentation, original choice of design solutions and discerning evaluation. 5

Accomplished experimentation, inventive choice of design solutions and thoughtful evaluation. 4

Some experimentation, appropriate choice of design solutions and acceptable evaluation. 3

Basic experimentation and choice of design solutions with unclear evaluation. 2Minimal experimentation and choice of design solution with vague evaluation. 1No experimentation, choice of design solution or evaluation. 0Total 5Criterion 4: Analysis(Evidence of use of analysis methods, clear thinking and sound reasoning)Well analysed, synthesizing all relevant information. 5Well analysed, considering most relevant information. 4Clearly analysed using some information. 3Simple analysis, mainly relying on supplied information. 2Minimal analysis or engagement with information. 1No analysis or engagement with information. 0Total 5Criterion 5: Originality(Evidence of innovation and originality; application of interrelated thinking and innovative development process)Designs produced are original, displaying innovative design solutions. 5Designs produced show some originality and innovation. 4

usIng rasch measurement to Improve analytIcal markIng keys

Page 25: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

110 Assessment Matters 6 : 2014

2012 Design Course Marking Key Practical (portfolio) 50% (34 marks)Description MarksProduces effective designs. 3Produces simple designs, relying heavily on pre-existing designs. 2Designs produced are complete but are not original. 1Designs are incomplete showing no originality. 0Total 5Criterion 6: Design elements and principles(Evidence of effective use of design elements and principles)Effectively applies design elements and principles and uses them to demonstrateselectivity and discernment.

3

Uses design elements and principles appropriately. 2Shows little use of design elements and principles. 1No use of design elements and principles. 0Total 3Criterion 7: Design process(Evidence of competence in use of design process; brainstorming, idea generationmethods and/or visualisation of concepts)Demonstrates skilled use of appropriate design processes. 3Competently applies a design process. 2Little evidence of the application of a design process. 1No attempt to apply a design process. 0Total 3Criterion 8: Communication and visual literacies(Ability to respond to a brief and construct a visual image that conveys a message)Responds to a brief and expressively communicates a visual message to the intended audience. 3

Responds to aspects of a brief and adequately communicates a visual message to the intended audience. 2

Limited response to a brief with little evidence of the communication of a visual message. 1

No evidence of response to a brief or a visual message. 0Total 3

tarrIcone and cooper

Page 26: Using Rasch measurement to improve analytical marking keysresult from a research project which investigated the use and assessment ... and item discrimination. Background and purpose

Assessment Matters 6 : 2014 111

The authorsDr Pina Tarricone is a postdoctoral research fellow with the Centre for Schooling and Learning Technologies at Edith Cowan University in Western Australia. Pina’s PhD in educational/cognitive psychology won the 2007 edith Cowan University research Medal. Psychology Press published her thesis research as a book titled The Taxonomy of Metacognition. She has a M ed in Interactive Multimedia and has just completed a M ed in educational Measurement. Her interests include educational psychology constructs, psychometrics, and the use of technologies for assessment.

e-mail: [email protected] (corresponding author)

Dr Martin G. Cooper is a lecturer and researcher with the Centre for Schooling and learning Technologies at edith Cowan University in Western Australia. His current projects and research interests include investigating the feasibility of using digital representations of work for authentic and reliable performance assessment in senior secondary school courses, implementation of technology education in remote Western Australia, and the use of “bring your own digital device” (BYOD) strategies in secondary and tertiary education.

e-mail: [email protected]

usIng rasch measurement to Improve analytIcal markIng keys