326
TECHNICAL REPORT ALGEBRA I, BIOLOGY, AND LITERATURE 2011 Provided by Data Recognition Corporation February 2012

2011 Keystone Exams Technical Report and Statistics/Keystones/2011... · Appendix C: Keystone Exams ... Appendix J: Raw-to-Scaled Score Conversion Tables ... 2011 Keystone Exams Technical

  • Upload
    habao

  • View
    251

  • Download
    4

Embed Size (px)

Citation preview

  • TECHNICAL REPORT ALGEBRA I, BIOLOGY, AND LITERATURE

    2011

    Provided by Data Recognition Corporation February 2012

  • Table of Contents

    2011 Keystone Exams Technical Report

    TABLE OF CONTENTS

    Glossary of Common Terms ............................................................................................................................. i

    Preface: An Overview of Assessments .......................................................................................................... vii

    The Keystone Exams From 2008 to Present ...................................................................................................... vii

    Assessment Activities Occurring in the 2010-2011 School Year ....................................................................... viii

    Chapter One: Background of the Keystone Exams ........................................................................................... 1

    Assessment History in Pennsylvania .................................................................................................................... 1

    The Keystone Exams ............................................................................................................................................ 1

    Chapter Two: Test Development Overview of the Keystone Exams .................................................................. 5

    Keystone Blueprint/Assessment Anchors and Eligible Content .......................................................................... 5

    High-Level Test Design Considerations ................................................................................................................ 7

    Online Testing Design Considerations ................................................................................................................. 8

    Algebra I ............................................................................................................................................................... 9

    Biology ................................................................................................................................................................ 11

    Literature ........................................................................................................................................................... 12

    Literature Passages ............................................................................................................................................ 14

    Chapter Three: Item and Test Development Processes .................................................................................. 17

    General Keystone Test Development Processes ............................................................................................... 17

    General Test Definition ...................................................................................................................................... 18

    Algebra I Test Definitions ................................................................................................................................... 19

    Biology Test Definitions ..................................................................................................................................... 21

    Literature Test Definitions ................................................................................................................................. 23

    Item Development Considerations .................................................................................................................... 25

    Item and Test Development Cycle ..................................................................................................................... 27

    General Item and Test Development Process ................................................................................................... 30

    Chapter Four: Universal Design Procedures Applied to the Keystone Exams Test Development Process ......... 37

    Universal Design................................................................................................................................................. 37

    Elements of Universally Designed Assessments ................................................................................................ 37

    Guidelines for Universally Designed Items ........................................................................................................ 39

    Item Development ............................................................................................................................................. 40

  • Table of Contents

    2011 Keystone Exams Technical Report

    Item Format ....................................................................................................................................................... 41

    Assessment Accommodations ........................................................................................................................... 42

    Chapter Five: Field Test Leading to the Spring 2011 Core ............................................................................... 43

    Field Test Overview ............................................................................................................................................ 43

    Fall 2010 Keystone Exams Standalone Field Test .............................................................................................. 43

    Statistical Analysis of Item Data ......................................................................................................................... 47

    Review of Items with Data ................................................................................................................................. 48

    Differential Item Functioning ............................................................................................................................. 49

    Chapter Six: Operational Forms Construction for Spring 2011 ........................................................................ 53

    Final Selection of Items and Keystone Forms Construction .............................................................................. 53

    Special Forms Used with the Operational 2011 Keystone Exams ..................................................................... 54

    Chapter Seven: Test Administration Procedures............................................................................................ 57

    Sections, Sessions, Timing, and Layout of the Keystone Exams ........................................................................ 57

    Sections and Sessions ........................................................................................................................................ 57

    Timing................................................................................................................................................................. 58

    Layout................................................................................................................................................................. 60

    Shipping, Packaging, and Delivery of Materials ................................................................................................. 61

    Chapter Eight: Processing and Scoring .......................................................................................................... 63

    Receipt of Materials ........................................................................................................................................... 63

    Scanning of Materials ........................................................................................................................................ 64

    Materials Storage ............................................................................................................................................... 67

    Scoring Multiple-Choice Items ........................................................................................................................... 67

    Rangefinding ...................................................................................................................................................... 67

    Reader Recruitment and Qualifications ............................................................................................................. 68

    Leadership Recruitment and Qualifications....................................................................................................... 68

    Training .............................................................................................................................................................. 69

    Handscoring Process .......................................................................................................................................... 70

    Handscoring Validity Process ............................................................................................................................. 70

    Quality Control ................................................................................................................................................... 72

  • Table of Contents

    2011 Keystone Exams Technical Report

    Chapter Nine: Description of Data Sources and Sampling Adequacy .............................................................. 75

    Student Filtering Criteria .................................................................................................................................... 75

    Key Verification Data ......................................................................................................................................... 76

    Calibration of Operational Test Data ................................................................................................................. 76

    Final Data ........................................................................................................................................................... 76

    Spiraling of Forms .............................................................................................................................................. 76

    Chapter Ten: Summary Demographic and Accommodation Data for Spring 2011 Keystone Exams ................. 79

    Assessed Students .............................................................................................................................................. 79

    Demographic Characteristics of Students Receiving Test Scores ...................................................................... 82

    Test Accommodations Provided ........................................................................................................................ 86

    Glossary of Accommodation Terms ................................................................................................................. 100

    Chapter Eleven: Classical Item Statistics ...................................................................................................... 105

    Item-Level Statistics ......................................................................................................................................... 105

    Item Difficulty .................................................................................................................................................. 105

    Item Discrimination .......................................................................................................................................... 106

    Discrimination of Difficulty Scatterplot ........................................................................................................... 107

    Observations and Interpretations .................................................................................................................... 109

    Chapter Twelve: Rasch Item Calibration ...................................................................................................... 111

    Description of the Rasch Model ....................................................................................................................... 111

    Checking Rasch Assumptions ........................................................................................................................... 112

    Item Parameter Invariance .............................................................................................................................. 116

    Rasch Item Statistics ........................................................................................................................................ 119

    Chapter Thirteen: Standard Setting ............................................................................................................ 125

    Standard Setting and Performance Level Descriptors ..................................................................................... 125

    Development Overview for the Performance Level Descriptors ..................................................................... 125

    PLD Meeting 1 .................................................................................................................................................. 126

    PLD Meeting 2 .................................................................................................................................................. 129

    Standard Setting............................................................................................................................................... 132

  • Table of Contents

    2011 Keystone Exams Technical Report

    Chapter Fourteen: Scaling .......................................................................................................................... 147

    Raw Scores to Rasch Ability Estimates ............................................................................................................. 147

    Rasch Ability Estimates to Scaled Scores ......................................................................................................... 148

    Raw-to-Scaled Score Tables ............................................................................................................................. 150

    Chapter Fifteen: Equating ........................................................................................................................... 151

    Pre- vs. Post-Equating ...................................................................................................................................... 151

    Equating Design for Keystone Exams ............................................................................................................... 152

    Evaluation of Item Parameter Stability ............................................................................................................ 152

    Equating for the Future Embedded Field-Test Items ....................................................................................... 153

    Chapter Sixteen: Scores and Score Reports ................................................................................................. 155

    Scoring.............................................................................................................................................................. 155

    Description of Total Test Scores ...................................................................................................................... 155

    Description of Module Scores .......................................................................................................................... 157

    Appropriate Score Use ..................................................................................................................................... 158

    Cautions for Score Use ..................................................................................................................................... 159

    Report Development ................................................................................... 160

    Reports ............................................................................................................................................................. 160

    Chapter Seventeen: Operational Test Statistics ........................................................................................... 165

    Performance Level Statistics ............................................................................................................................ 165

    Scaled Scores .................................................................................................................................................... 165

    Raw Scores ....................................................................................................................................................... 166

    Chapter Eighteen: Reliability ...................................................................................................................... 169

    Reliability Indices ............................................................................................................................................. 170

    Coefficient Alpha .............................................................................................................................................. 170

    Further Interpretations .................................................................................................................................... 172

    Standard Error of Measurement ...................................................................................................................... 174

    Results and Observations ................................................................................................................................. 176

    Rasch Conditional Standard Errors of Measurement ...................................................................................... 177

    Results and Observations ................................................................................................................................. 178

    Decision Consistency ........................................................................................................................................ 180

    Results and Observations ................................................................................................................................. 182

    Rater Agreement .............................................................................................................................................. 182

  • Table of Contents

    2011 Keystone Exams Technical Report

    Results and Observations ................................................................................................................................. 182

    Chapter Nineteen: Validity ......................................................................................................................... 185

    Purposes and Intended Uses of the Keystone Exams ...................................................................................... 185

    Evidence Based on Test Content ...................................................................................................................... 185

    Evidence Based on Response Process.............................................................................................................. 187

    Evidence Based on Internal Structure .............................................................................................................. 188

    Evidence Based on Relationships with Other Variables .................................................................................. 192

    Evidence Based on Consequences of Tests ..................................................................................................... 193

    Evidence Related to the Use of the Rasch Model ............................................................................................ 194

    Validity Evidence Summary .............................................................................................................................. 194

    Chapter Twenty: Special Study on Test Administration Mode ...................................................................... 197

    Summary of Students Demographic Distributions ......................................................................................... 197

    Survey Results Summary .................................................................................................................................. 197

    Mode DIF Summary of Operational Items ....................................................................................................... 200

    Raw-to-Scaled Score Comparison .................................................................................................................... 200

    Mode DIF Summary of Field-Test Items........................................................................................................... 206

    References ................................................................................................................................................. 207

    Appendix A: Understanding Depth of Knowledge and Cognitive Complexity ................................................ 211

    Algebra I, Algebra II, and Geometry Depth of Knowledge ............................................................................... 212

    Biology Depth of Knowledge ............................................................................................................................ 215

    Literature Depth of Knowledge ....................................................................................................................... 219

    English Composition Depth of Knowledge ....................................................................................................... 222

    References ....................................................................................................................................................... 225

    Appendix B: General Scoring Guidelines ..................................................................................................... 227

    Algebra I ........................................................................................................................................................... 227

    Biology .............................................................................................................................................................. 228

    Literature ......................................................................................................................................................... 229

    Appendix C: Keystone Exams Spring 2011 Tally Sheets ................................................................................ 231

    Algebra I ........................................................................................................................................................... 231

    Biology .............................................................................................................................................................. 233

    Literature ......................................................................................................................................................... 235

  • Table of Contents

    2011 Keystone Exams Technical Report

    Appendix D: Item and Test Development Process for the Keystone Exams ................................................... 237

    Appendix E: Item and Data Review Card Examples ..................................................................................... 243

    Item Review Card Example .............................................................................................................................. 243

    Data Review Card Example .............................................................................................................................. 244

    Appendix F: Item Rating Sheet and Criteria Guidelines ................................................................................ 247

    Item Rating Sheet ............................................................................................................................................. 247

    Item Review Criteria Guidelines ...................................................................................................................... 248

    Appendix G: Keystone Exams Spring 2011 Module Layout Plans .................................................................. 251

    Appendix H: Mean Raw Scores by Form ...................................................................................................... 253

    Algebra I ........................................................................................................................................................... 253

    Biology .............................................................................................................................................................. 254

    Literature ......................................................................................................................................................... 255

    Appendix I: Item Statistics .......................................................................................................................... 257

    Algebra I Multiple-Choice Item Statistics ......................................................................................................... 258

    Biology Multiple-Choice Item Statistics ........................................................................................................... 267

    Literature Multiple-Choice Item Statistics ....................................................................................................... 282

    Algebra I Constructed-Response Item Statistics .............................................................................................. 292

    Biology Constructed-Response Item Statistics ................................................................................................ 294

    Literature Constructed-Response Item Statistics ............................................................................................ 296

    Appendix J: Raw-to-Scaled Score Conversion Tables ................................................................................... 299

    Algebra I ........................................................................................................................................................... 299

    Biology .............................................................................................................................................................. 301

    Literature ......................................................................................................................................................... 303

    Appendix K: Reliabilities ............................................................................................................................. 305

    Algebra I ........................................................................................................................................................... 305

    Biology .............................................................................................................................................................. 307

    Literature ......................................................................................................................................................... 309

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page i

    GLOSSARY OF COMMON TERMS

    The following table contains some terms used in this technical report and their meanings. Some of these terms are used universally in the assessment community, and some of these terms are used commonly by psychometric professionals.

    Term Common Definition

    Ability

    In Rasch measurement, ability is a generic term indicating the level of an individual on the construct measured by an exam. For the Keystone Exams, as an example, a students literature ability is measured by how the student performed on the Keystone Literature exam. A student who answered more items correctly has a higher ability estimate than a student who answered fewer items correctly.

    Adjacent Agreement

    Adjacent agreement is a score/rating difference of one (1) point in value usually assigned by two different raters under the same conditions (e.g., two independent raters give the same paper scores that differ by one point).

    Alternate Forms

    Alternate forms are two or more versions of a test that are considered exchangeable, for example, they measure the same constructs in the same ways, are intended for the same purposes, and are administered using the same directions. More specific terminology applies depending on the degree of statistical similarity between the test forms (e.g., parallel forms, equivalent forms, and comparable forms) where parallel forms refers to the situation in which the test forms have the highest degree of similarity to each other.

    Average

    Average is a measure of central tendency in a score distribution that usually refers to the arithmetic mean of a set of scores. In this case, it is determined by adding all the scores in a distribution and then dividing the obtained value by the total number of scores. Sometimes people use the word average to refer to other measures of central tendency such as the median (the score in the middle of a distribution) or mode (the score value with the greatest frequency).

    Bias

    In a statistical context, bias refers to any source of systematic error in the measurement of a test score. In discussing test fairness, bias may refer to construct-irrelevant components of test scores that differentially affect the performance of different groups of test takers (e.g., gender, ethnicity). Attempts are made to reduce bias by conducting item fairness reviews and various differential item functioning (DIF) analyses, detecting potential areas of concern, and either removing or revising the flagged test items prior to the development of the final operational form of the test (see also differential item functioning).

    Constructed-Response Item

    A constructed-response (CR) itemreferred to by some as an open-ended (OE) response itemis an item format that requires examinees to create their own responses, which can be expressed in various forms (e.g., written paragraph, created table/graph, formulated calculation). Such items are frequently scored using more than two score categories, that is, polytomously (e.g., 0, 1, 2, 3). This format is in contrast to when students make a choice from a supplied set of answer options, for example, multiple-choice (MC) items which are typically dichotomously scored as right = 1 or wrong = 0. When interpreting item difficulty and discrimination indices, it is important to consider whether an item is polytomously or dichotomously scored.

    Content Validity Evidence

    Content validity evidence shows the extent to which an exam provides an appropriate sampling of a content domain of interest (e.g., assessable portions of Algebra I curriculum in terms of the knowledge, skills, objectives, and processes sampled).

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page ii

    Term Common Definition

    Criterion-Referenced Interpretation

    The criterion-referenced score is interpreted as a measure of a students performance against an expected level of mastery, educational objective, or standard. The types of resulting score interpretations provide information about what a student knows or can do in a given content area.

    Cut Score

    A cut score marks a specified point on a score scale where scores at or above that point are interpreted or acted upon differently from scores below that point (e.g., a score designated as the minimum level of performance needed to pass a competency test). A test can be divided into multiple proficiency levels by setting one or more cut scores. Methods for establishing cut scores vary. For the Keystone Exams, three cut scores are used to place students into one of four performance levels (see also standard setting).

    Decision Consistency

    Decision consistency is the extent to which classifications based on test scores would match the decisions on students proficiency levels based on scores from a second parallel form of the same test. It is often expressed as the proportion of examinees who are classified the same way from the two test administrations.

    Differential Item Functioning

    Differential item functioning is a statistical property of a test item in which different groups of test takers (who have the same total test score) have different average item scores. In other words, students with the same ability level but different group memberships do not have the same probability of answering the item correctly (see also bias).

    Distractor An incorrect option in a multiple-choice item (also called a foil).

    Equating

    The strongest of several linking methods used to establish comparability between scores from multiple tests. Equated test scores should be considered exchangeable. Consequently, the criteria needed to refer to a linkage as equating are strong and somewhat complex (equal construct and precision, equity, and invariance). In practical terms, it is often stated that it should be a matter of indifference to a student if he/she takes any of the equated tests. Also see Linking.

    Exact Agreement Exact agreement indicates identical scores/ratings are assigned by two different raters under the same conditions (e.g. two independent raters give a paper the same score).

    Field-Test (FT) Items

    The Keystone Exams use multiple test forms for each content area test. Each form is composed of operational (OP) items and field-test (FT) items. An FT item is a newly developed item that is ready to be tried out to determine its statistical properties (e.g., see p-value and Point-Biserial Correlation). Each test form includes a set of FT items. FT items are not part of any student scores.

    Frequency Frequency is the number of times that a certain value or range of values (score interval) occurs in a distribution of scores.

    Frequency Distribution

    Frequency distribution is a tabulation of scores from low to high or high to low with the number and/or percent of individuals who obtain each score or who fall within each score interval.

    Infit/Outfit

    Infit and outfit are statistical indicators of the agreement of the data and the measurement model. Infit and outfit are highly correlated, and they both are highly correlated with the point-biserial correlation. Underfit can be caused when low-ability students correctly answer difficult items (perhaps by guessing or atypical experience) or high-ability students incorrectly answer easy items (perhaps because of carelessness or gaps in instruction). Any model expects some level of variability, so overfit can occur when nearly all low-ability students miss an item while nearly all high-ability students get the item correct.

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page iii

    Term Common Definition

    Item Difficulty

    For the Rasch model, the dichotomous item difficulty represents the point along the latent trait continuum where an examinee has a 0.50 probability of making a correct response. For a constructed-response item, the difficulty is the average of the items step difficulties (see also step difficulty).

    Key The key is the correct response option or answer to a test item.

    Linking

    Linking is a generic term referring to one of a number of processes by which scores from one or more tests are made comparable to some degree. Linking includes several classes of transformations (equating, scale alignment, prediction, etc.). Equating is associated with the strongest degree of comparability (exchangeable scores). Other linkages may be very strong, but fail to meet one or more of the strict criteria required of equating (see also equating).

    Logit

    In Rasch scaling, logits are units used to express both examinee ability and item difficulty. When expressing examinee ability, a student who answers more items correctly has a higher logit than a student who answers fewer items correctly. Logits are transformed into scale scores through a linear transformation. When expressing item difficulty, logits are transformed p-value (see also p-value). The logit difficulty scale is inversely related to p-values. A higher logit value would represent a relatively harder item, while a smaller logit value would represent a relatively easier item.

    Mean

    Mean is also referred to as the arithmetic mean of a set of scores. It is found by adding all the score values in a distribution and dividing by the total number of scores. For example, the mean of the set {66, 76, 85, and 97} is 81. The value of a mean can be influenced by extreme values in a score distribution.

    Measure

    In Rasch scaling, measure generally refers to a specific estimate of an examinees ability (often expressed as logits) or an items difficulty (again, often expressed as logits). As an example for the Keystone Exams, a students literature measure might be equal to 0.525 logit. Or, a Keystone literature test item might have a logit equal to -0.905.

    Median

    The median is the middle point or score in a set of rank-ordered observations that divides the distribution into two equal parts; each part contains 50 percent of the total data set. More simply put, half of the scores are below the median value and half of the scores are above the median value. As an example, the median for the following ranked set of scores {2, 3, 6, 8, 9} is 6.

    Module

    On score reports, a module often refers to a set of items on a test measuring the same contextual area (e.g., Operations and Linear Equations & Inequalities in Algebra I). Items developed to measure the same reporting category would be used to determine the module score (sometimes called subscale score).

    Multiple-Choice Item

    Multiple-choice item is a type of item format that requires the test taker to select a response from a group of possible choices, one of which is the correct answer (or key) to the question posed (see also constructed-response item).

    N-count

    Sometimes designated as N or n, it is the number observations (usually individuals or students) in a particular group. Some examples include the number of students tested, the number of students tested from a specific subpopulation (e.g., females), and the number of students who attained a specific score. In the following set {23, 32, 56, 65, 78, 87}, n = 6.

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page iv

    Term Common Definition

    Operational Item

    The Keystone Exams use multiple test forms for each content area test. Each form is composed of operational (OP) items and field-test (FT) items. OP items are the same on all forms for any content area test. Student total raw scores and scaled scores are based exclusively on the OP items.

    Percent Correct

    When referring to an individual item, the percent correct is the items p-value expressed as a percent (instead of a proportion). When referring to a total test score, it is the percentage of the total number of points that a student received. The percent correct score is obtained by dividing the students raw score by the total number of points possible and multiplying the result by 100. Percent correct scores often used in criterion-referenced interpretations and are generally more helpful if the overall difficulty of a test is known. Sometimes percent correct scores are incorrectly interpreted as percentile ranks.

    Percentile

    Percentile is the score or point in a score distribution at or below which a given percentage of scores fall. It should be emphasized that it is a value on the score scale, not the associated percentage (although sometimes in casual usage this misinterpretation is made). For example, if 72 percent of the students score at or below a scaled score of 1500 on a given test, then the scaled score of 1500 would be considered the 72nd percentile. As another example, the median is the 50th percentile.

    Percentile Rank

    The percentile rank is the percentage of scores in a specified distribution that fall at/below a certain point on a score distribution. Percentile ranks range in value from 1 to 99. They indicate the status or relative standing of an individual within a specified group, by indicating the percent of individuals in that group who obtained equal or lower scores. An individuals percentile rank can vary depending on which group is used to determine the ranking. As suggested above, Percentiles and percentile ranks are sometimes used interchangeably; however, strictly speaking, a percentile is a value on the score scale.

    Performance Level Descriptors

    Performance level descriptors are descriptions of an individuals competency in a particular content area, which is usually defined as ordered categories (e.g., from below basic to advanced,) on a continuum. The exact labeling of these categories and narrative descriptions may vary from one assessment or testing program to another.

    Point-Biserial Correlation

    In classical test theory, point-biserial correlation is an item discrimination index. It is the correlation between a dichotomously scored item and a continuous criterion, usually represented by the total test score (or the corrected total test score with the reference item removed). It reflects the extent to which an item differentiates between high-scoring and low-scoring examinees. This discrimination index ranges from 1.00 to +1.00. The higher the discrimination index (the closer to +1.00), the better the item is considered to be performing. For multiple-choice items scored as 0 or 1, it is rare for the value of this index to exceed 0.5.

    p-value

    P-value is an index indicating an items difficulty for some specified group (perhaps grade). It is calculated as the proportion (sometimes percent) of students in the group who answer an item correctly. P-values range from 0.0 to 1.0 on the proportion scale. Lower values correspond to more difficult items and higher values correspond to easier items. P-values are usually provided for multiple-choice items or other items worth one point. For open-ended items or items worth more than one point, difficulty on a p-value-like scale can be estimated by dividing the item mean score by the maximum number of points possible for the item (see also Logit).

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page v

    Term Common Definition

    Raw Score

    Raw score (RS) is an unadjusted score usually determined by tallying the number of questions answered correctly or by the sum of item scores (i.e., points). (Some rarer situations might include formula-scoring, the amount of time required to perform a task, the number of errors, application of basal/ceiling rules, etc.). Raw scores typically have little or no meaning by themselves and require additional information like the number of items on the test, the difficulty of the test items, norm-referenced information, or criterion-referenced information.

    Reliability

    Reliability is the expected degree to which test scores for a group of examinees are consistent over exchangeable replications of an assessment procedure, and therefore, considered dependable and repeatable for an individual examinee. A test that produces highly consistent, stable results (i.e., relatively free from random error) is said to be highly reliable. The reliability of a test is typically expressed as a reliability coefficient or by the standard error of measurement derived by that coefficient.

    Reliability Coefficient

    Reliability coefficient is a statistical index that reflects the degree to which scores are free from random measurement error. Theoretically, it expresses the consistency of test scores as the ratio of true score variance to total score variance (true score variance plus error variance). This statistic is often expressed as correlation coefficient (e.g., correlation between two forms of a test) or with an index that resembles a correlation coefficient (e.g., calculation of a tests internal consistency using coefficient alpha). Expressed this way, the reliability coefficient is a unitless index. The higher the value of the index (closer to 1.0), the greater the reliability of the test (see also standard error of measurement).

    Scaled Score

    Scaled score is a mathematical transformation of a raw score developed through a process called scaling. Scaled scores are most useful when comparing test results over time. Several different methods of scaling exist, but each is intended to provide a continuous and meaningful score scale across different forms of a test.

    Spiraling

    Spiraling is a packaging process used when multiple forms of a test exist and it is desired that each form be tested in all classrooms (or other grouping unit such as a school) participating in the testing process. This process allows for the random distribution of test booklets to students. For example, if a package has four test forms labeled A, B, C, & D, the order of the test booklets in the package would be: A, B, C, D, A, B, C, D, A, B, C, D, etc.

    Standard Deviation

    Standard deviation (SD) is a statistic that measures the degree of spread or dispersion of a set of scores. The value of this statistic is always greater than or equal to zero. If all of the scores in a distribution are identical, the standard deviation is equal to zero. The further the scores are away from one another in value, the greater the standard deviation. This statistic is calculated using the information about the deviations (distances) between each score and the distributions mean. It is equivalent to the square root of the variance statistic. The standard deviation is a commonly used method of examining a distributions variability since the standard deviation is expressed in the same units as the data.

  • Glossary of Common Terms

    2011 Keystone Exams Technical Report Page vi

    Term Common Definition

    Standard Error of Measurement

    Standard error of measurement (SEM) is the amount an observed score is expected to fluctuate around the true score. As an example, across replications of a measurement procedure, the true score will not differ by more than plus or minus one standard error from the observed score about 68 percent of the time (assuming normally distributed errors). The SEM is frequently used to obtain an idea of the consistency of a persons score in actual score units, or to set a confidence band around a score in terms of the error of measurement. Often a single SEM value is calculated for all test scores. On other occasions, however, the value of the SEM can vary along a score scale. Conditional standard errors of measurement (CSEM) provide an SEM for each possible scaled score.

    Standard Setting

    Also referred to as performance level setting, standard setting is a procedure used in the determination of the cut scores for a given assessment. It is used to measure students progress towards certain performance standards. Standard setting methods vary (e.g., modified Angoff, Bookmark Method ), but most use a panel of educators and expert judgments to operationalize the level of achievement students must demonstrate in order to be categorized within each performance level.

    Step Difficulty

    Step difficulty is a parameter estimate in Masters partial credit model (PCM) that represents the relative difficulty of each score step (e.g., going from a score of 1 to a score of 2). The higher the value of a particular step difficulty, the more difficult a particular step is relative to other score steps.

    Technical Advisory Committee

    The technical advisory committee (TAC) is a group of individuals (most often professionals in the field of testing), that are either appointed or selected to make recommendations for and to guide the technical development of a given testing program.

    Validity Validity is the degree to which accumulated evidence and theory support specific interpretations of test scores entailed by the purpose of a test. There are various ways of gathering validity evidence.

  • Preface: An Overview of the Assessments

    2011 Keystone Exams Technical Report Page vii

    PREFACE: AN OVERVIEW OF THE ASSESSMENTS

    THE KEYSTONE EXAMS FROM 2008 TO PRESENT

    COMPREHENSIVE GRADUATION COMPETENCY ASSESSMENT PROGRAM

    In 2008, the Commonwealth of Pennsylvania initiated a comprehensive graduation competency assessment program. The goals of this program include the following:

    To provide for a system that is aligned, focused, standards-based, accurate, universally applicable, and publicly accessible

    To develop, produce, distribute, administer (both online and paper/pencil), collect, score, analyze, track, and report results of graduation competency assessments for ten high-school-level content areas: Algebra I, Algebra II, Biology, Chemistry, Civics and Government, English Composition, Geometry, Literature, U.S. History, and World History, with each area or course comprised of modules containing unique content

    To provide graduation competency testing opportunities for students three times each school yearspring, summer, and fallwith students permitted to retake modules until proficiency is achieved on each module

    To report graduation competency results under accelerated timelines

    To ensure validity and reliability of the assessment systems through technically sound test development and psychometric practices, detailed statistical analyses and research studies, and well-documented processes and quality procedures

    The Keystone Exams, as the graduation competency assessments are named, are just one component of Pennsylvanias new system of high school graduation requirements. Keystone Exams are designed to help school districts guide students toward meeting state standardsstandards aligned with expectations for success in college and the workplace. In order to receive a diploma, students must also meet local district credit and attendance requirements and complete a culminating project, along with any additional district requirements.

    For graduating classes, students must demonstrate successful completion of secondary-level course work in Algebra I, Biology, English Composition, and Literature, in which the Keystone Exam serves as the final course exam. Students Keystone Exam scores will count for at least one-third of the final course grades.

    Based upon Chapter 4 regulations, each Keystone Exam is designed in modules that reflect distinct, related academic content common to the traditional progression of coursework. Students who do not score Proficient or above on a Keystone Exam module may choose to complete a project-based assessment for that module based upon other specific requirements.

  • Preface: An Overview of the Assessments

    2011 Keystone Exams Technical Report Page viii

    ASSESSMENT ACTIVITIES OCCURRING IN THE 20102011 SCHOOL YEAR

    The first assessment activities took place in the 20102011 school year. Prior to November 2010, there were no Keystone Exams assessment events. The table below outlines the field tests and operational exams administered during the 201011 school year.

    Following the development of Assessment Anchors and Eligible Content, exams were developed for initial field test in 2010 and were subsequently administered as operational exams in 2011. Additional exams, based on the Assessment Anchors and Eligible Content developed in 2009 and 2010, were developed for initial field test in 2011. Detailed information about the operational exam activities that occurred during the 20102011 school year is in the Keystone Exams Spring 2011 Algebra I, Biology, and Literature Technical Report.

    Field Test and Operational Exams during the 201011 School Year

    Exam Assessment Activity Date Algebra I Initial Standalone Field Test Fall 2010 (November) Algebra I Inaugural Operational Exam Administration Spring 2011 (May) Algebra II Initial Standalone Field Test Spring 2011 (May) Biology Initial Standalone Field Test Fall 2010 (November) Biology Inaugural Operational Exam Administration Spring 2011 (May) English Composition Initial Standalone Field Test Spring 2011 (May) Geometry Initial Standalone Field Test Spring 2011 (May) Literature Initial Standalone Field Test Fall 2010 (November) Literature Inaugural Operational Exam Administration Spring 2011 (May)

  • Chapter One: Background of the Keystone Exams

    2011 Keystone Exams Technical Report Page 1

    CHAPTER ONE: BACKGROUND OF THE KEYSTONE EXAMS

    This brief overview of the Pennsylvania Keystone Exams summarizes the history of the programs development process, intent and purpose, and recent changes.

    ASSESSMENT HISTORY IN PENNSYLVANIA

    Pennsylvanias involvement in state-wide assessment actually began in the 1969 70 school year with a purely school-based assessment known as Educational Quality Assessment (EQA), which continued through the 19871988 school year. A state-mandated student competency testing program called Testing for Essential Learning and Literacy Skills (TELLS) also operated from the school years of 198485 through 199091. Also in 1990, the state initiated an on-demand writing assessment.

    The Pennsylvania System of School Assessment (PSSA) program was instituted in 1992 as a school evaluation model with reporting at the school level only. The PSSA initially measured performance in the content areas of mathematics and reading at Grades 5, 8, and 11 for, and in writing at Grades 6 and 9. Starting in 1994, as part of the Chapter 5 Regulations, the PSSA added student level reports. In 1999, as part of Chapter 4 Regulations, the State Board of Education adopted the Pennsylvania Academic Standards for Mathematics and for Reading, Writing, Speaking, and Listening. Proficiency levels for Advanced, Proficient, Basic, and Below Basic were defined in 2000. In 2001 and 2004, the reading and mathematics assessments underwent various content enhancements to improve alignment to the 1999 Academic Standards. Grade 11 was added to the writing assessment in 2001. Then, in 20042005, the PSSA Assessment Anchors and Eligible Content were developed to clarify content structure and improve articulation between assessment and instruction. In addition, in 2005, the Grade 6 and 9 writing assessments were moved to Grades 5 and 8. By 2006, the operational mathematics and reading assessments incorporated Grades 3 through 8 and 11. In 2007, the PSSA and the PSSA Assessment Anchors and Eligible Content underwent additional content enhancements. In 2008, science was added to the PSSA as an operational assessment. The PSSA has continued in this current configuration since 2008.

    THE KEYSTONE EXAMS

    In 2008, the Commonwealth of Pennsylvania initiated a comprehensive graduation competency assessment program. As a key piece of this program, the Keystone Exams are designed to assess proficiency in various subject areas, including Algebra I, Algebra II, Biology, Chemistry, Civics and Government, English Composition, Geometry, Literature, U.S. History, and World History. The Keystone Exams are just one component of Pennsylvanias high school graduation requirements. Students must also earn state-specified credits, fulfill the states service learning and attendance requirements, and complete any additional local school system requirements to receive a Pennsylvania high school diploma.

    The stated goals of the Keystone program are to

    provide for a system that is aligned, focused, standards based, accurate, universally applicable, and publicly accessible.

    develop, produce, distribute, administer (both online and paper/pencil), collect, score, analyze, track, and report results of graduation competency assessments for ten high-school-level content areas: Algebra I, Algebra II, Biology, Chemistry, Civics and Government, English Composition, Geometry, Literature, U.S. History, and World History, with each area or course composed of modules containing unique content.

  • Chapter One: Background of the Keystone Exams

    2011 Keystone Exams Technical Report Page 2

    provide graduation competency testing opportunities for students three times each school yearspring, summer, and fallwith students permitted to retake modules until proficiency is achieved on each module.

    report graduation competency results under accelerated timelines.

    ensure validity and reliability of the assessment systems through technically sound test development and psychometric practices, detailed statistical analyses and research studies, and well-documented processes and quality procedures.

    GRADUATION REQUIREMENTS AND THE KEYSTONE EXAMS

    Based upon Chapter 4 regulations, each Keystone Exam is designed in modules that reflect distinct, related academic content common to the traditional progression of coursework. Students who do not score Proficient or above on a Keystone Exam module may choose to complete a project-based assessment for that module based upon the requirements as detailed below.

    If a student is unable to meet the requirements in 4.24(b)(1)(iv)(A) (relating to high school graduation requirements) after two attempts on a Keystone Exam, the student may supplement a Keystone Exam score through satisfactory completion of a project-based assessment. Points earned through satisfactory performance on one or more project modules related to the Keystone Exam module or modules on which the student did not pass shall be added to the students highest Keystone Exam score.

    Students may qualify to participate in one or more project-based assessments if the student has met all of the following conditions:

    1. The student has taken the course.

    2. The student was unsuccessful in achieving a score of Proficient on the Keystone Exam after at least two attempts.

    3. The student has met the districts attendance requirements for the course.

    4. The student has participated in a satisfactory manner in supplemental instructional services under 4.24(i).

    KEYSTONE ASSESSMENT ANCHORS AND ELIGIBLE CONTENT

    In 2009, the state initiated development of test designs and test blueprints for the Keystone Exams based on Pennsylvania Keystone Course Standards and the Common Core State Standards. Committees of Pennsylvania educators met in 2009, 2010, and 2011 to write, review, and approve Assessment Anchors and Eligible Content statements and sample exam items. To provide initial focus, each test blueprint committee was presented materials specific to the exam in question, including a basic blueprint structure, the Pennsylvania State Standards, the Common Core State Standards, and draft eligible content statements based on the Standards and the Common Core State Standards. The results from the initial committee work was evaluated by national, state, and local subject experts, and, following revisions, they were ultimately validated by another committee of Pennsylvania educators. Following committee approval, the Keystone Assessment Anchors and Eligible Content statements for literacy, mathematics, and science were approved by the State Board of Education in September 2010.

  • Chapter One: Background of the Keystone Exams

    2011 Keystone Exams Technical Report Page 3

    Mathematics

    The first committee meetings took place in April 2009, where initial drafts of the test blueprints were developed for Algebra I, Algebra II, and Geometry.

    A follow-up committee meeting for the three mathematics exams was held in August 2009.

    Literacy

    The first committee meetings took place in April 2009, where initial drafts of the test blueprints were developed for English Composition and Literature.

    A follow-up committee meeting for the two literacy exams was held in November 2009.

    Science

    The first committee meetings took place in October 2009, where the initial draft of the test blueprint was developed for Biology.

    A follow-up committee meeting for Biology was held in January 2010.

    In addition, in January 2010, the initial draft of the test blueprint was developed for Chemistry.

    Chemistry was part of a follow-up committee meeting held in late January 2010.

    Social Studies

    The first committee meetings took place in November 2010, where initial drafts of the test blueprints were developed for Civics and Government, U.S. History, and World History.

    A follow-up committee meeting for the Civics and Government exam was held in October 2011.

    A follow-up committee meeting for U.S. History and World History remains unscheduled pending further decisions about the future of those Keystone exams.

    WAVE IMPLEMENTATION OF THE EXAMS

    The implementation plan for the Keystone Exams envisioned the ten Keystone Exams becoming operational during a series of four distinct waves. The initial wave included Algebra I, Biology, and Literature. These first three were field tested in fall 2010 and reached operational status with the spring 2011 administration. The second wave included Algebra II, English Composition, and Geometry. These next three were field tested in spring 2011 and are now planned to reach operational status with the spring 2013 administration. The third wave includes Civics and Government. The seventh exam will undergo initial field testing in spring 2013 with operational status being reached with the spring 2014 administration. The implementation of the final wave, which includes the Chemistry, U.S. History, and World History exams, is currently unscheduled.

  • Chapter One: Background of the Keystone Exams

    2011 Keystone Exams Technical Report Page 4

    Table 11. Keystone Exams Wave Implementation Plan

    Wave Exam(s) Initial Field Test First

    Operational 1 Algebra I, Biology, Literature fall 2010 spring 2011 2 Algebra II, English Composition, Geometry spring 2011 spring 2013* 3 Civics and Government spring 2013* spring 2014* 4 Chemistry, U.S. History, World History TBD TBD

    *Projected

    MODE OF DELIVERY FOR THE EXAMS

    One key feature of the Keystone Exams is the dual mode of delivery of the testing materials that is available to districts. In addition to the traditional paper-and-pencil format, the Keystone Exams are also available in a computer-based online format using test-delivery software.

    While exam materials are still available in the traditional format (two pieces of exam materialsa test book and a separate answer book [or, in the case of English Composition, a single test/answer book]), districts are given the option to administer the exams using computer-based online testing software instead of the paper/pencil format.

    For more information about how the online exams were developed in concert with the traditional paper-and-pencil format, see Chapter Three.

    For more information about the effect of the mode on item performance, see Chapter Twenty.

    MULTIPLE TESTING OPPORTUNITIES

    Another key feature of the Keystone Exams is the multiple testing opportunities provided to students. Main administrations in both spring and winter provide options for students completing course work at various times of the year and accommodates both traditional and block scheduling. In addition, a summer retest opportunity is also available. The first winter administration is scheduled for winter 2013. The first summer administration is scheduled for summer 2013.

    PERFORMANCE LEVELS FOR THE KEYSTONE EXAMS

    The State Board approved a set of criteria defining Advanced, Proficient, Basic, and Below Basic levels of performance for the Keystone Exams. More information about these Performance Level Descriptors (PLDs) is found in Chapter Thirteen.

    OPERATIONAL TEST DESIGN INFORMATION

    The test definition of each of the operational Keystone Exams, including information about exam-specific test designs, test blueprints, test layouts, item types, and other exam elements is detailed in Chapter Two.

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 5

    CHAPTER TWO: TEST DEVELOPMENT OVERVIEW OF THE KEYSTONE EXAMS

    KEYSTONE BLUEPRINT/ASSESSMENT ANCHORS AND ELIGIBLE CONTENT

    The Keystone Test Blueprintsknown as the Keystone Exams Assessment Anchors and Eligible Contentare based on Pennsylvania Keystone Course Standards and the Common Core State Standards. Prior to the development of the Assessment Anchors, multiple groups of Pennsylvania educators convened to create a set of standards for each of the Keystone Exams. Derived from a review of existing standards, these Enhanced Standards (Course Standards) focus on what students need to know and be able to do in order to be ready for college and career.

    Although the Keystone Course Standards indicate what students should know and be able to do, Assessment Anchors are designed to indicate which parts of the Keystone Course Standards (Instructional Standards) will be assessed on the Keystone Exams. Based on recommendations from Pennsylvania educators, the Assessment Anchors were designed as a tool to improve the articulation of curricular, instructional, and assessment practices. The Assessment Anchors clarify what is expected and focus the content of the standards into what is assessable on a large-scale exam. The Assessment Anchor documents also serve to communicate Eligible Contentthe range of knowledge and skills from which the Keystone Exams are designed.

    The Keystone Exams Assessment Anchors and Eligible Content have been designed to hold together or anchor the state assessment system and curriculum/instructional practices in schools by following these design parameters:

    Clear: The Assessment Anchors are easy to read and are user friendly; they clearly detail which standards are assessed on the Keystone Exams.

    Focused: The Assessment Anchors identify a core set of standards that can be reasonably assessed on a large-scale assessment; this keeps educators from having to guess which standards are critical.

    Rigorous: The Assessment Anchors support the rigor of the state standards by assessing higher order and reasoning skills.

    Manageable: The Assessment Anchors define the standards in a way that can be easily incorporated into a course to prepare students for success.

    The Assessment Anchors and Eligible Content are organized into cohesive blueprints, each structured with a common labeling system. This framework is organized by increasing levels of detail: first, by Module (Reporting Category); second, Assessment Anchor; third, Anchor Descriptor; fourth, Eligible Content statement. The common format of this outline is followed across the Keystone Exams.

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 6

    Here is a description of each level in the labeling system for the Keystone Exams:

    Module: The Assessment Anchors are organized into two thematic modules for each of the Keystone Exams, and these modules serve as the Reporting Categories for the Keystone Exams. The module title appears at the top of each page in the Assessment Anchor document. The module level is also important because the Keystone Exams are built using a module format, with each of the Keystone Exams divided into two equally sized test modules. Each module is made up of two or more Assessment Anchors.

    Assessment Anchor: The Assessment Anchor appears in the shaded bar across the top of each Assessment Anchor table in the Assessment Anchor document. The Assessment Anchors represent categories of subject matter that anchor the content of the Keystone Exams. Each Assessment Anchor is part of a module and has one or more Anchor Descriptors unified under it.

    Anchor Descriptor: Below each Assessment Anchor in the Assessment Anchor document is a specific Anchor Descriptor. The Anchor Descriptor level details the scope of content covered by the Assessment Anchor. Each Anchor Descriptor is part of an Assessment Anchor and has one or more Eligible Content unified under it.

    Eligible Content: The column to the right of the Anchor Descriptor in the Assessment Anchor document contains the Eligible Content statements. The Eligible Content is the most specific description of the content that is assessed on the Keystone Exams. This level is considered the assessment limit. It helps educators identify the range of content covered on the Keystone Exams.

    Enhanced Standard: In the column to the right of each Eligible Content statement is a code representing one or more Enhanced Standards that correlate to the Eligible Content statement. Some Eligible Content statements include annotations that clarify the scope of an Eligible Content.

    Notes: There are three types of notes included in the Assessment Anchor document.

    e.g. (for example)sample approach, but not a limit to the Eligible Content

    i.e. (that is)specific limit to the Eligible Content

    Notecontent exclusions or definable range of the Eligible Content

    The Assessment Anchors coding is read like an outline. The coding includes the Subject (Exam), Reporting Category/Module, Assessment Anchor, Anchor Descriptor, and Eligible Content. Each exam has two modules. Each module has two or more Assessment Anchors. Each of the Assessment Anchors has one or more Anchor Descriptors, and each Anchor Descriptor has at least one Eligible Content (generally more than one). The Assessment Anchors form the basis of the test design for the exams undergoing test development. In turn, this hierarchy is the basis for organizing the total module and exam scores (based on the core [common] portions).

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 7

    Table 21. Sample Keystone Assessment Anchor Coding

    Sample Code

    Subject (Exam)

    Reporting Category (Module)

    Assessment Anchor

    (AA)

    Anchor Descriptor (AD)

    Eligible Content (EC)

    A1.1.1.2.1

    A1 1 1 2 1

    Algebra I

    Operations and Linear Equations & Inequalities

    Linear Equations

    Write, solve, and/or graph linear equations using various methods.

    Write, solve, and/or apply a linear equation (including problem situations).

    BIO.A.2.1.1

    BIO A 2 1 1

    Biology Cells and Cell Processes The Chemical Basis for Life

    Describe how the unique properties of water support life on Earth.

    Describe the unique properties of water and how these properties support life on Earth (e.g., freezing point, high specific heat, cohesion).

    L.F.2.4.1

    L F 2 4 1

    Literature Fiction

    Analyzing and Interpreting LiteratureFiction

    Use appropriate strategies to interpret and analyze the universal significance of literary fiction.

    Interpret and analyze works from a variety of genres for literary, historical, and/or cultural significance.

    The complete set of Assessment Anchors and Eligible Content can be referenced at PDEs Standards Aligned System (SAS) website: www.pdesas.org/Standard/KeystoneDownloads.

    HIGH-LEVEL TEST DESIGN CONSIDERATIONS

    The Keystone Exams employs two types of test items (questions): multiple choice and constructed response. These item types assess different levels of knowledge and provide different information about achievement. Psychometrically, multiple-choice items are very useful and efficient tools for collecting information about a students academic achievement. Constructed-response performance tasks generally generate fewer scoreable points than multiple-choice items in the same amount of testing time; however, they provide tasks that are more realistic and sample eligible content that best lends itself to this item type. Furthermore, well-constructed scoring guides have made it possible to include constructed-response tasks in large-scale assessments, and trained scorers apply the scoring guides to efficiently score large numbers of student responses in a highly reliable way. The design of the Keystone attempts to achieve a reasonable balance between the two item types.

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 8

    Table 22. Keystone Exams High Level Design Considerations

    Exam

    MC as percent of Core Points

    CR as percent of Core Points

    Number of Points per MC

    Number of Points

    per CR

    Number of Modules

    Number of Assessment

    Anchors

    Number of Eligible Content

    Algebra I 60 40 1 4 2 6 33 Biology 73 27 1 3 2 8 38 Literature 65 35 1 3 2 4 56

    DEPTH OF KNOWLEDGE

    The goal of each Keystone Exam is for each item to be of sufficient rigor, or Webbs Depth of Knowledge Level 3. Webbs Depth of Knowledge (DOK) was created by Norman Webb from the Wisconsin Center for Education Research. Webbs definition of depth of knowledge is the degree or complexity of knowledge that the content curriculum standards and expectations require. Therefore, when reviewing items for depth of knowledge, the item is reviewed to determine whether or not it is as demanding cognitively as what the actual content curriculum standard expects. In the case of the Pennsylvania Keystone items, the item meets the criterion if the depth of knowledge of the item is in alignment with the depth of knowledge of the Assessment Anchor as defined by the Eligible Content. Webbs DOK includes four levels, from the lowest (basic recall) to the highest (extended thinking).

    In some specific cases, DOK Level 2 was allowed when the cognitive intent of an Eligible Content was Level 2. For more information on DOK, see Chapter Three and Appendix A.

    ONLINE TESTING DESIGN CONSIDERATIONS

    The Keystone Exams were designed from the beginning to provide a dual mode of test delivery, using traditional paper and pencil forms and using computer-based online forms. The computer-based online testing environment (called INSIGHT) is designed to provide a testing experience that mirrors the elements of traditional paper-and-pencil-based test delivery. This includes not only standard ancillary testing materials available in or with the printed forms like formula sheets, periodic tables, scoring guidelines, and response spaces but also analogs of the mechanical elements of response generation not necessarily associated with a computer screen interface. These elements include line guides, rulers, screen highlighters, magnifiers, equation building software, online calculators and graphing tools, and keyboard shortcuts.

    Consideration of other components of online testinglike item layout, passage layout, font, screen resolution, and navigation tools and other interface mechanismsall played a role in the overall design constraints, with some considerations having a more meaningful impact on specific exams. For more information on how the online test design impacted the overall test design considerations, see the sections below under each exam. See also Chapter Twenty for more information on a study comparing the use of both modes of delivery.

    Online testing also provides an opportunity to utilize software to generate scores for student responses. In cases where responses to questions invoke numerical strings or equations, online responses can be scored through the use of lookup tables. Lookup tables are automated scoring rubrics that contain common correct and incorrect responses. When a response does not match a record in the lookup table, a human scorer is used to

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 9

    adjudicate the score. Operational autoscoring was only used for the Algebra I Exam; see below for more information on its use in Algebra I. For more information on scoring, see Chapter Eight.

    ALGEBRA I

    The Keystone Algebra I Exam has two reporting categories: Module 1, Operations and Linear Equations & Inequalities, and Module 2, Linear Functions and Data Organizations. Both modules include three Assessment Anchors. Module 1 has 18 Eligible Content, and Module 2 has 15 Eligible Content. Each module corresponds to specific content aligned to statements and specifications included in the course-specific Assessment Anchor documents. The Algebra I content included in the Keystone Algebra I multiple-choice items aligns with the Assessment Anchors and Eligible Content statements. The process skills, directives, and action statements also specifically align with the Assessment Anchors and Eligible Content statements. The content included in Algebra I constructed-response items aligns with content included in the Eligible Content statements. The process skills, directives, and action statements included in the performance demands of the Algebra I constructed-response items align with specifications included in the Assessment Anchor statements, the Anchor Descriptor statements, and/or the Eligible Content statements. In other words, the verbs or action statements used in the constructed-response items or stems can come from the Eligible Content, Anchor Descriptor, or Assessment Anchor statements.

    ALGEBRA I ONLINE CONSIDERATIONS

    Students taking the computer-based online delivery of the Algebra I exam are provided with online versions of several common tools typically available to a student taking a traditional paper and pencil exam. Each student has access to the following online tools: a standard four-function calculator, a scientific calculator, a graphing tool (similar, but not identical to, a graphing calculator), a ruler (available in metric and English units), a highlighter, a line guide, a magnifier, a sticky note generator, and a cross-off tool. In addition, an equation builderwhich allows students to generate complex equations not normally possible with a standard keyboardis also made available with all constructed-response items. Also, if the constructed-response item requests that the student draw, label, or otherwise change a graph, special graph-drawing tools are provided for on-screen graph generation. The Algebra I general scoring guideline and formula sheets are also available to students.

    Layout of both the multiple-choice and constructed-response items is optimized for minimal screen manipulation (minimal scrolling required to see graphics or text that extend beyond the visible working space on the computer screen), and exam items are scrutinized carefully in both print and online versions for continuity and accuracy.

    ALGEBRA I MULTIPLE-CHOICE ITEMS

    Sixty percent of the possible points on the Algebra I Exam are derived from multiple-choice items. This item type is especially efficient for measuring a broad range of content. Each multiple-choice item has four response options, only one of which is correct. The student is awarded one point for choosing the correct response. Distractors typically represent incorrect concepts, incorrect logic, incorrect application of an algorithm, or computational errors.

    Algebra I multiple-choice items are intended to take about one and a half minutes of response time per item. They are used to assess a variety of skill levels, including problem solving. Algebra I items involving

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 10

    application emphasize the requirement to carry out some mathematical process to find an answer rather than simply recalling information from memory.

    ALGEBRA I CONSTRUCTED-RESPONSE ITEMS

    Constructed-response items (tasks) require that students read a problem description and develop an appropriate solution. Algebra I constructed-response items are designed to take about ten minutes of response time per item. Most of the constructed-response items have several components for the overall task that may enable students to enter or begin the problem at different places. In some items, each successive component is designed to assess progressively more difficult skills or higher knowledge levels. Certain components may ask students to explain their reasoning for applying particular operations or for arriving at certain conclusions. The types of tasks utilized do not necessarily require computations. Students may also be asked to perform such tasks as constructing a graph, shading some portion of a figure, or listing object combinations that meet specified criteria.

    Constructed-response tasks are especially useful for measuring students problem-solving skills in Algebra. They offer the opportunity to present real-life situations that necessitate student solving problems using mathematics abilities learned in the classroom. Students must read the task carefully, identify the necessary information, devise a method of solution, perform the calculations, enter the solution directly in the answer document, and, when required, offer an explanation. This provides insight into the students mathematical knowledge, abilities, and reasoning processes.

    The constructed-response Algebra items are scored on a 04 point scale using an item-specific scoring guideline. The item-specific scoring guideline outlines the requirements for each score point. Item-specific scoring guidelines are based on the Algebra I General Description of Scoring Guidelines. The general guidelines describe a hierarchy of responses, which represent the five score levels. See Appendix B or the Algebra I Keystone Feb 2011 Item and Scoring Sampler available on PDEs Standards Aligned System (SAS) website: www.pdesas.org/Assessment/Keystone.

    The Algebra I Keystone Exam includes two types of constructed-response items: Scaffolded Constructed Response Items (SCR) and Extended Constructed Response Items (ECR). Both types are scored on the same 04 point scale using the same Algebra I General Description of Scoring Guidelines as the base. SCR items are constructed to generally elicit four distinct responses (a response may contain more than one answer blank), and each response has the potential to earn a discrete number of score points (generally just one (1) score point per response). In turn, the four distinct responses are generally organized into four sections, with each labeled as a Part within an SCR. The next table shows a generic (non-authentic) illustration of the application of this concept.

    Table 23. Generic Example [Non-Authentic] Showing Concept of Four Distinct Responses

    Stem Part A Part B Part C Part D Presents a numerical distribution

    In the answer spaces, write the list of numbers from least to greatest

    Write the mean in an answer blank

    Write the median in an answer blank

    Write the mode in an answer blank

    4 points 1 distinct point even though students enter more than one number

    1 distinct point with one distinct entry

    1 distinct point with one distinct entry

    1 distinct point with one distinct entry

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 11

    SCR items do not require narrative, explanation, or show all your work responses.

    Most SCR item responses lend themselves to automatic scoring; however, not all items can be automatically scored exclusively with the use of lookup tables. The full application of Assessment Anchors and Eligible Content sometimes requires item construction that is incompatible with lookup tables.

    In familiar and probably the most descriptive terms, Algebra I ECR itemsin form, format, and scoring provisionsadhere to the philosophy of PSSA OE item format. Like SCR items, development is based on the item qualities that best measure the skills and concepts with which the item aligns.

    ECR items intentionally elicit narrative, explanation of reasoning, explain why , and/or show your work responses.

    In contrast to SCR items, in which DOK Level 3 cognitive engagement is inferred from student responses, ECR items (through explanations and recorded work) often provide direct evidence of DOK Level 3 engagement. This aspect of ECR items is intentionally included during development. Following initial development, the ECR item will be approved by PDE as accepted by the review committee, or PDE and DRC will collaborate in amending the item.

    BIOLOGY

    The Keystone Biology Exam has two reporting categories: Module 1[A], Cells and Cell Processes, and Module 2 [B], Continuity and Unity of Life. Both modules have four Assessment Anchors. Module A has 16 Eligible Content, and Module B has 22 Eligible Content. Each module corresponds to specific content aligned to statements and specifications included in the course-specific assessment anchor documents. The Biology content included in the Keystone Biology multiple-choice items aligns with the Assessment Anchors and Eligible Content statements. The process skills, directives, and action statements also specifically align with the Assessment Anchors and Eligible Content statements. The content included in Biology constructed-response items aligns with content included in the Eligible Content statements. The process skills, directives, and action statements included in the performance demands of the Biology constructed-response items align with specifications included in the Assessment Anchor statements, the Anchor Descriptor statements, and/or the Eligible Content statements. In other words, the verbs or action statements used in the constructed-response items or stems can come from the Eligible Content, Anchor Descriptor, or Assessment Anchor statements.

    BIOLOGY ONLINE CONSIDERATIONS

    Students taking the computer-based online delivery of the Biology Exam are provided with online versions of several common tools typically available to a student taking a traditional paper and pencil exam. Each student has access to the following online tools: a highlighter, a line guide, a magnifier, a sticky note generator, and a cross-off tool. The Biology general scoring guideline and a periodic table are also provided to students.

    Layout of both the multiple-choice and constructed-response items is optimized for minimal screen manipulation (minimal scrolling to see graphics or text that extend beyond the visible working space on the computer screen), and exam items are scrutinized carefully in both print and online versions for continuity and accuracy.

  • Chapter Two: Test Development Overview of the Keystone Exams

    2011 Keystone Exams Technical Report Page 12

    BIOLOGY MULTIPLE-CHOICE ITEMS

    Seventy-three percent of the possible points on the Biology Exam are derived from multiple-choice items. Multiple-choice items are especially efficient for measuring a broad range of content. Each multiple-choice item has four response options, on