0216 Brown Education Loveless

8/12/2019 0216 Brown Education Loveless

1/36


2/36

ABOUT BROOKINGSThe Brookings Institution is a private nonprot organizationdevoted to independent research and innovative policy solutions.For more than 90 years, Brookings has analyzed current andemerging issues and produced new ideas that matterfor thenation and the world.

ABOUT THE BROWN CENTER ONEDUCATION POLICYRaising the quality of education in the United States for morepeople is imperative for societys well-being. With that goal inmind, the purpose of the Brown Center on Education Policy atBrookings is to examine the problems of the American educationsystem and to help delineate practical solutions. For moreinformation, see our website, www.brookings.edu/brown.

This report was made possible by the generous nancial supportof The Brown Foundation, Inc., Houston.


3/36

HOW WELL ARE AMERICANSTUDENTSLEARNING?With sections on predicting theeffect of the Common Core StateStandards, achievement gaps on thtwo NAEP tests, and misinterpretininternational test scores. February 2012

Volume III, Number 1

The 2012 Brown Center Reporton American Education:

by:TOM LOVELESSSenior Fellow, The Brown Center on Education Policy


4/36

TABLE OF CONTENTS

3 Introduction

PART I

6 Predicting the Effect of Common Core StateStandards on Student Achievement

PART II

16 Measuring Achievement Gaps on NAEP

PART III

24 Misinterpreting International Test Scores

32 Notes

Research assistance by:MICHELLE CROFT

Copyright 2012 byTHE BROOKINGS INSTITUTION1775 Massachusetts Avenue, NW

Washington, D.C. 20036www.brookings.edu

All rights reserved


5/36

The Brown Center Report on American Education3

THE 2012 BROWN CENTER REPORT

ON AMERICAN EDUCATION

This edition of the Brown Center Report on American Education marks therst issue of volume threeand eleventh issue over all. The rst install-ment was published in 2000, just as the Presidential campaigns of George

W. Bush and Al Gore were winding down. Education was an importantissue in that campaign. It has not been thus far in the current campaign forthe Republican nomination (as of February 2012). And it is unlikely to bea prominent issue in the fall general election. Despite that, the three stud-ies in this Brown Center Report investigate questions that the victor in the2012 campaign, and the team assembled to lead the U.S. Department ofEducation, will face in the years ahead.

The rst section is on the Common Core State Standards, a project thatPresident Obama has backed enthusiastically. Forty-six states and the

District of Columbia have signed on to the Common Core; detailedstandards have been written in English language arts and mathematics;and assessments are being developed to be ready by the 20142015 schoolyear. The rst section attempts to predict the effect of the Common Coreon student achievement.

Despite all the money and effort devoted to developing the CommonCore State Standardsnot to mention the simmering controversy overtheir adoption in several statesthe study foresees little to no impact onstudent learning. That conclusion is based on analyzing states past experi-ence with standards and examining several years of scores on the National

Assessment of Educational Progress (NAEP).


6/36


7/36


Achievement gaps are the test score differences between groups of students

with different socioeconomic (SES) characteristics: for example, racial orethnic background, family income, or language status. The second sectionposes the question: Do the two NAEP tests report similar achievementgaps? Researchers and policy makers are well aware that signicant testscore gaps exist between SES groups. Researchers try to study them, policymakers try to close them. What NAEP has to say about the magnitude ofsuch gaps plays an important role in the policy arena. The analysis presentedin section two indicates that the two NAEPs do in fact differ. The MainNAEP consistently reports larger SES gaps. This is only a preliminary study,a rst cut at the data that reveals a general pattern, so the ndings must beviewed cautiously. And explanations for the phenomenon are necessarilyspeculative. More work needs to be done on this topic.

The third section of the report is on international assessments.Interpretations of international test scores are characterized by three com-

mon mistakes. The rst occurs when a nations scores go up or downdramatically and analysts explain the test score change by pointing to a par-ticular policy. The case of Polands gains in reading is offered as an excellentexample of dubious causality attributed to a single policy. The second mis-take stems from relying on rankings to gauge a countrys academic standing.National rankings have statistical properties that can mislead observers intothinking that large differences are small or small differences are large. Theycan also make growth appear larger or smaller than it really is. Several ex-

amples are provided of misinterpretations of rankings and suggestions onhow to avoid them. The third mistake is pointing to a small group of high-performing nations, often called A+ countries, and recommending, withno additional analysis, that their policies should be adopted. The samepolicies may be embraced by the lowest performing nations or nations inthe middle of the distribution. On any test, the entire distribution must beconsidered, not just scores at the top.


8/36

The Brown Center Report on American Education

PREDICTINGTHE EFFECT OFCOMMON CORESTATE STANDARDSON STUDENT ACHIEVEMENT

Part

I


9/36


FORTY-SIX STATES AND THE DISTRICT OF COLUMBIA HAVEsigned on to the Common Core State Standards Initiative, aproject sponsored by the Council of Chief State School Ofcers(CCSSO) and the National Governors Association (NGA). The CommonCore spells out what students should learn in mathematics and English-language arts from kindergarten to the end of high school. The standardswere written by teams of curriculum specialists and vetted by panels ofacademics, teachers, and other experts. 1 In 2010, the federal governmentfunded two consortia to develop assessments aligned with the CommonCore. The new tests are to be ready in 2014.

The push for common education standardsargues that all American students shouldstudy a common curriculum, take compara-ble tests to measure their learning, and havethe results interpreted on a common scale,with the scale divided into performancelevels to indicate whether students are

excelling, learning an adequate amount, orfalling short. Past experience with standardssuggests that each part of this apparatusacommon curriculum, comparable tests, andstandardized performance levelsis neces-sary. No one or two of them can stand alonefor the project to succeed.

Proponents point to the intuitiveappeal of a common curriculum. Its ludi-crous, Bill Gates told the Wall Street Journal,

to think that multiplication in Alabamaand multiplication in New York are reallydifferent.2 In a report called The ProciencyIllusion, The Fordham Institute made a simi-lar point regarding state efforts to evaluateschools using fty different assessments andfty different denitions of what consti-

tutes acceptable performance.3

How can aschool in one state be labeled a failure whilea school in another state and with almostexactly the same test scores can be consid-ered a success?

The authority to operate schoolsystems is constitutionally vested in states.But states have undermined their own cred-ibility when it comes to measuring studentlearning. Accounts of dumbed-down and


10/36

8 The Brown Center Report on American Education

preceded NCLB.7 A new Common Core testwill presumably end such discrepancies byevaluating prociency using the same stan-dards for every state, and these standardsare to be more rigorous than those currentlyused. Schools and students will respond byreaching for these loftier goals. Lets call thisthe rigorous performance standards theory.

The third hypothesis is that stan-dardization yields its own efciencies. Inthe same Wall Street Journal interview citedabove, Bill Gates referred to this idea by

complaining about the time and moneywasted on the many different versions oftextbooks that are published to conformto individual states curricular tastes. 8 In areverse spin on the same argument, oth-ers argue that textbooks are bloated withredundant content as publishers attemptto incorporate numerous idiosyncraticstate curricular mandates into one book. 9 The assumption of both arguments is thatone, high-quality textbookor perhaps a

few that are aligned with the same contentstandardsused by all American studentsattending the same grade would be animprovement over the status quo. Otherproponents point to the potential gaps inlearning that occur as students move fromstate to state. Especially when studentsmove mid-year, important concepts mightbe missed while other concepts are studiedunnecessarily a second time. Teachers whomove from state to state experience similar

difculties in terms of lesson planning. Letscall this the standardization theory.

Opposing ArgumentsSome analysts question the theories behindthe Common Core. Writing in EducationWeek in the summer of 2011, Andrew Portercompared the Common Core to existingstate standards and international standardsfrom other countries and concluded that

Part I: Predicting the Effect of Common Core State Standards on Student Achievement

poorly-written state tests, manipulation ofcut scores to articially boost the numberof students in higher performance levels,and assessments on which students can getfewer than 50% of items correct and yetscore procient fuel the belief that statesindividually cannot be trusted to give thepublic an accurate estimate of how Americaneducation is doing. 4

Three Theorized EffectsThe Common Core State Standards are theo-

rized to improve education in three ways.First, proponents argue that the CommonCore is superior to most current statestandards. In a recent study, The FordhamInstitute concluded that Common Corestandards are better than 37 states standardsin English-language arts and 39 states inmathematics. 5 It follows, proponents believe,that the Common Core will raise the qualityof education nationally by dening a higher-quality curriculum in English-language

arts and mathematics than is currentlytaught. Lets call this the quality theory. Achievement will increase because studentswill study a better curriculum.

The second idea is that the CommonCore sets higher expectations than currentstate standards, the assumption being thatcut points on the new assessments will beset at a higher level than states currently seton their own tests. Comparisons with theNational Assessment of Educational Progress

(NAEP) lead many analysts to concludethat states set prociency standards far toolow. States routinely report more studentsattaining prociency than NAEP indicates,often 3040 percentage points more. 6 The No Child Left Behind Act left it up to thestates to design their own tests and to setperformance levels wherever they want, butthe pattern of states reporting signicantlyhigher percentages of procient students

The Common Core StateStandards are theorizedto improve education inthree ways.


11/36


the Common Core does not represent muchimprovement. 10 Opponents of the CommonCore, including Sandra Stotsky, JamesMilgram, Zeev Wurman, and WilliamsonEvers, criticize the quality of the proposedstandards for English-language arts and math-ematics. They conclude that the math stan-dards, in particular, are inferior to existingstandards in Massachusetts and California. 11

Critics of the Common Core issued acounter-manifesto arguing that the proposedcommon standards would undermine the

decentralized, federalist principles on whicheducation has been governed since Americasfounding. Declaring that a one-size-ts-all,centrally controlled curriculum does notmake sense, the counter-manifesto states thatonly weak evidence supports the push fornational standards. International test data arenot helpful since most countries have nationalstandards and the few that do not, includingCanada and Germany, have both impres-sive and non-impressive scores. Concern for

interstate student mobility is overblown, thecounter-manifesto claims, because very fewstudents move between states. Most mobilityis within state, which is already addressed bythe No Child Left Behind Acts requirement thatevery state establish standards. Since 2003,every state has state curriculum standards thatdelineate the curriculum for public schoolswithin its borders. 12

Can empirical evidence shed lighton the main points of contention in this

debate? Not entirely. Much of the argumentis philosophical. Those who believe that theCommon Core enumerates what schoolsshould be teaching and students should belearning support the proposed standards.

And those who believe a greater degree ofstandardization would produce more com-mon educational outcomesand that com-mon outcomes are desirablealso supportthe proposed standards. Those holding to

the opposite beliefs, and believing that localschool governance is preferable to governanceby larger entities, are critics of the standards.

Despite the philosophical disagree-ments, there are empirical questions onwhich evidence exists. The nation has hadseveral years of experience with educationstandardssince the 1980s in many statesand since 2003 in all statesand data existthat can help predict the magnitude ofeffects from the Common Core. How muchdoes raising the quality of standards mat-

ter in boosting student achievement? Willraising the bar for attaining prociencyinother words, increasing the rigor of perfor-mance standardsalso raise achievement?

And how much variance will be reducedor how much sameness in achievementwill be attainedby having students acrossthe country studying a common curriculum?

Quality and AchievementLets start with the theory that high-quality

standards promote achievement gains. InOctober 2009, a colleague at Brookings,Grover Russ Whitehurst, investigatedwhether quality ratings for state standards,as judged by the two most cited ratings(from the American Federation of Teachersand Fordham Foundation), are correlatedwith state NAEP scores. Whitehurst foundthat they are not. States with weak contentstandards score about the same on NAEPas those with strong standards. The nding

of no relationship held up whether NAEPscores from 2000, 2003, 2005, 2007, orthe gains from 20002007 were used in theanalysis. And it held up for the scores ofboth white and black students. 13

The current study extends that inquiryby looking at NAEP data from 20032009.Gain scores on NAEP reading and mathtests from 2003 and 2009 are combinedto form a composite gain score. The scores

data exist that canhelp predict the magni-tude of effects from theCommon Core.


12/36


13/36


2005 NAEP scores are unrelated to wherethe states drew the line for prociency in2005. Fourth-grade reading and math haveslightly negative correlations (0.22 and0.12, respectively), as does eighth-gradereading (0.11). The correlation coefcientfor eighth-grade math is 0.00. State achieve-ment is unrelated to the level at which statesdene prociency. The same is true for 2009NAEP scores and the level at which pro-ciency was placed that year (see the secondcolumn of the table).

The nal column of Table 1-2 investi-gates whether changes in state NAEP scoresfrom 20052009 are related to changes inprociency level. Did states that raised thebar also perform better? And did states thatlowered the bar perform worse? Correlationcoefcients for 8th grade are near zero.Positive and statistically signicant correla-tions were found for fourth-grade reading(0.35) and fourth-grade math (0.34). It isinteresting that the absolute level of perfor-

mance standards does not seem to matterbut raising or lowering levels does exhibita relationship with fourth grade changes inachievement, explaining about 12% of thevariation in the change in state NAEP scores.

Whether one phenomenon is caus-ing the other is difcult to tell. Changes inprociency cut points are probably endog-enous to trends in test scores. In otherwords, states with rising scores may feelemboldened to raise their prociency cut

points and those with declining scores mayfeel compelled to lower theirs. That is quitea different story than the raising or loweringof cut points producing changes in testscores. Unfortunately, simple correlationscannot determine the direction of causal-ity, or if causality exists at all, only whetherthese two variables are statistically related.In the current analysis, change in level isrelated to change in fourth-grade scores.

How Common Will

Achievement Become? The third theory concerns standardization.For the Common Core movement, attain-ing greater standardization of educationaloutcomes is an important goal. If standardsdo not reduce variation, then even if theyboost performance, simply raising aver-age scores will still leave many statesandthe districts, schools, and students withinstatesfar behind and far below acceptablelevels of performance. The two previous

analyses indicate that it is unlikely that com-mon standards will boost performance; how-ever, it is possible for the national averageon NAEP to remain stable while variation isreducedfor instance, if top states declinea little while states at the bottom rise by thesame amount. Another way would be forhigh ying schools within states to declinea little while poorly performing schoolsincrease their performance by a commensu-rate amount.

In terms of state NAEP scores, varia-tion comes in two forms: variation betweenstates and variation within states. We wouldexpect common standards to reduce variationbetween states, so that the NAEP score dif-ference between states at the top and bottomof the rankings would be reduced. Statesthat currently offer vastly different curricula,assessments, and performance standards willharmonize those elements of their educa-

tional systems. One would expect test scoredifferences to shrink. That is the essence ofcommon standards. Within-state variation,on the other hand, remains unaffected bycommon standards. Every state already hasstandards placing all districts and schoolswithin its borders under a common regime.

And despite that, every state has tremen-dous within-state variation in achievement.Schools that score at the top of the world on

the absolute level of performance standardsdoes not seem to matterbut raising or loweringlevels does exhibit arelationship with fourth grade changes inachievement


14/36


international assessments are within a shortcar trip, sometimes even within a short sub-way ride, from schools that score at the levelof the worlds lowest achieving nations.

Lets compare these two forms of varia-tion. Table 1-3 displays data on NAEP stan-dard deviations between and within states.Standard deviation is a measure of variation,the amount of spread in a group of data.On any particular test, about two-thirds ofobservations are within one standard devia-tion (above and below) of the average score.

Between-State SD is the standard deviationof NAEP scores for the fty states and theDistrict of Columbiahow much they differfrom each other. Within-State SD is theaverage of the standard deviations for thefty states and the District of Columbiahow much the students within each state, onaverage, differ from each other.

The ndings are clear. Most variationon NAEP occurs within states not betweenthem. The variation within states is four to

ve times larger than the variation betweenstates. Much of the similarity of state scorescomes from aggregating individual studentscores, which differ greatly, to the statelevel. The variation in student performancewithin states washes out to produce meansthat are alike across states. Consider this:fourth-grade NAEP scores in math rangefrom Massachusetts at the top with 252down to the District of Columbia with 219.That 33 point difference is not too much

larger than the average standard deviationwithin states (27.8). What does that mean?Consider Massachusetts and Mississippi,a state with low scores but not at the verybottom. Their NAEP means differ by 25points. Every state, including Massachusettsand Mississippi, has a mini-Massachusettsand Mississippi contrast within its ownborders. That variation will go untouchedby common state standards.

Discussion What effect will the Common Core haveon national achievement? The analysispresented here suggests very little impact.The quality of the Common Core standardsis currently being hotly debated, but thequality of past curriculum standards hasbeen unrelated to achievement. The rigorof performance standardshow high thebar is set for prociencyhas also beenunrelated to achievement. Only a change in

performance levels has been related to anincrease in achievement, and that could justas easily be due to test score changes drivingchanges in policy, not the other way around.The Common Core may reduce variation inachievement between states, but as a sourceof achievement disparities, that is not wherethe action is. Within-state variation is four tove times greater.

The sources of variation in educationaloutcomes are not only of statistical impor-

tance but also bear on the question of howmuch state policy can be expected to changeschools. Whatever reduction in variationbetween, say, Naperville and Chicago thatcan be ameliorated by common standardshas already been accomplished by Illinoissstate efforts. State standards have alreadyhad a crack at it. Other states provide evenmore deeply rooted historical examples.California has had state curriculum frame-

The Common Core mayreduce variation inachievement betweenstates, but as a source ofachievement disparities,that is not where theaction is.


Between- and Within-State Variation of State NAEP Scores(in Standard Deviations)

Average StateNAEP Score Between-State SD Within-State SD

Multiple(Within/Between)

4th Grade Reading 220.1 6.6 34.7 5.3

4th Grade Math 239.5 6.3 27.8 4.4

8th Grade Reading 263.3 6.5 32.9 5.1

8th Grade Math 282.4 8.5 34.8 4.1

Table

1-3


15/36


works since at least 1962, statewide testingwith scores for every school published pub-lically since 1971 (except for a brief timeoutin the early 1990s), state textbook adoptionfor K8 since the nineteenth century, anda court-ordered equalized spending systemsince the late 1970s. Any effect that theselaws have on reducing achievement variationwithin the state has already occurred. TheCommon Core must go beyond these effortsto reduce variation in Californias achieve-ment. That is highly unlikely.

Two lessons can be drawn from theanalysis above. First, do not expect muchfrom the Common Core. Education lead-ers often talk about standards as if they area system of weights and measurestheword benchmarks is used promiscuouslyas a synonym for standards. But the term ismisleading by inferring that there is a real,known standard of measurement. Standardsin education are best understood as aspira-tional, and like a strict diet or prudent plan

to save money for the future, they representgood intentions that are not often realized. Why dont aspirational standards

make much of a difference? Researchersfrom the International Association for theEvaluation of Educational Achievement(IEA) rst sketched the concept of oppor-tunity to learn using international test scoredata in the 1970s. 16 Distinctions were drawnamong the intended, implemented, andachieved curriculums. The intended cur-

riculum is embodied by standards; it is whatgovernments want students to learn. Thedifferences articulated by state governmentsin this regard are frequently trivial. Bill Gatesis right that multiplication is the same in

Alabama and New York, but he would havea difcult time showing how those twostatesor any other two statestreat mul-tiplication of whole numbers in signicantlydifferent ways in their standards documents.

What is crucial is the distance betweenthe intended curriculum and the two cur-riculums below. The implemented curricu-lum is what teachers teach. Whether thatdiffers from state to state is largely unknown;what is more telling is that it may differdramatically from classroom to classroom inthe same school. 17 Two fourth-grade teachersin classrooms next door to each other mayteach multiplication in vastly different waysand with different degrees of effectiveness.State policies rarely touch such differences.

The attained curriculum is what studentslearn. Two students in the same classroomand instructed by the same teacher mayacquire completely different skills andknowledge. One student understands andmoves on; another struggles and is stuck.

And that even happens in classrooms withoutstanding teachers.

The whole system is teeming withvariation. Policies at national, state, district,and school levels sit on top of these internal

differences, but they rarely succeed in ame-liorating them. The Common Core will siton top of the implemented and attained cur-riculums, and notwithstanding future effortsto beef up the standards power to penetrateto the core of schooling, they will probablyfail to dramatically affect what goes on in thethousands of districts and tens of thousandsof schools that they seek to inuence.

A nal word on what to expect in thenext few years as the development of assess-

ments tied to the Common Core unfolds.The debate is sure to grow in intensity. It isabout big ideascurriculum and federal-ism. Heated controversies about the bestapproaches to teaching reading and mathhave sprung up repeatedly over the pastcentury. 18 The proper role of the federal gov-ernment, states, local districts, and schoolsin deciding key educational questions, espe-cially in deciding what should be taught,

Standards in educationare best understoodas aspirational.


16/36



remains a longstanding point of dispute. Inaddition, as NCLB illustrates, standards withreal consequences are most popular whenthey are rst proposed. Their popularitysteadily declines from there, reaching a nadirwhen tests are given and consequences kickin. Just as the glow of consensus surround-ing NCLB faded after a few years, cracks arenow appearing in the wall of support for theCommon Core.

Dont let the ferocity of the oncomingdebate fool you. The empirical evidence sug-

gests that the Common Core will have littleeffect on American students achievement.The nation will have to look elsewhere forways to improve its schools.

The empirical evidencesuggests that the CommonCore will have little effecton American studentsachievement.


17/36


18/36


II MEASURING ACHIEVEMENTGAPS ON NAEP

Part


19/36


THIS SECTION OF THE BROWN CENTER REPORT IS ABOUT TESTscore gaps among groups of students identied by race, ethnicity,family income, or native languagein other words, characteristicsrelated to socioeconomic status (SES). Much has been written aboutsuch gaps, and the National Assessment of Educational Progress (NAEP)frequently serves as the source of data to measure them.

There are two NAEP tests: the Main and theLong-Term Trend (LTT). Do the two NAEPsprovide similar estimates of SES achieve-ment gaps? As the analysis below shows, itappears that they do not. The discrepancyhas implications for national educationpolicy, especially since closing achievementgaps related to SES is a prominent goal offederal policy, included in the No Child LeftBehind Act (NCLB), Race to the Top, and theBlue Ribbon Schools selection criteria. NAEPserves as the most widely-recognized instru-ment for determining whether the nation ismaking progress towards meeting the goal.

BackgroundGaps between socioeconomic groups asmeasured by prominent tests have longinterested observers of U.S. education. TheColeman Report of 1966 and subsequentanalyses of the same data set, in particular,the Harvard-based Inequality studies, revealedlarge gaps between advantaged and disad-vantaged youngsters on tests of academic

achievement. 19 Gaps between blacks andwhites on the SAT, American College Test(ACT), and Armed Forces Qualifying Test(AFQT) have been analyzed to estimatedifferences in preparation for college andmilitary service and to document unequalopportunities for learning in Americansociety. Racial differences on IQ tests (e.g.,StanfordBinet Intelligence Scales, WechslerIntelligence Scale for Children) have beenpersistent for several decades and the expla-nations for them exceedingly controversial. 20

In the 1970s, tests came under refor containing racial bias. A district court

judge in a California court case, Larry P. v.Riles,21 found intelligence tests culturallybiased against blacks and banned usingthem in that state for placing students intospecial education programs. Analysts oftendraw a distinction between achievementtests, designed to measure the materialthat students have learned in the past, andaptitude testsof which IQ tests are a spe-ciesdesigned to measure students capacity


20/36


Part II: Measuring Achievement Gaps on NAEP

Aptitude has a connota-tion of being innate andimmutable.

for learning in the future. Aptitude testsare believed to be more prone to culturalbias because they depart from the schoolcurriculum, meaning that much of the skillthey measure may be picked up in familiesor among peers, social units that embodycultural differences. An achievement test,on the other handan Advanced PlacementChemistry Exam, for exampleis explicitlydesigned to measure whether studentshave learned curricular material that istaught in school. 22

These distinctions are fuzzy. What stu-dents have learned in the past undoubtedlyinuences their capacity for future learning.So there must be some overlap. A tellingsign of the confusion is apparent in theScholastic Aptitude Test (or SAT), which wasrenamed as the Scholastic Assessment Testin the 1990s, and then renamed again in2003 to just the SAT, an initialism signifyingno words. 23 The College Board was scram-bling to get on the right side of public views

toward the word aptitude. Aptitude has aconnotation of being innate and immutable. Achievement has a connotation of referringto the knowledge and skills one acquires bybeing a good student and working hard atschool. As a statistical matter, the SAT andits more achievement-oriented competitor,the ACT, are highly correlated (with a cor-relation coefcient of about 0.92). 24

Test items that are sensitive to cul-tural differences are described as culturally

loaded or having a pronounced culturalload factor. Test developers have strictprotocols for screening items for culturalbias. Despite its name change, the SAT hasreceived scrutiny. A study by Roy Freedle,a cognitive psychologist who worked forover thirty years at the Educational TestingService, caused a furor in 2003 by reportingstatistically signicant blackwhite differ-ences on many SAT items. 25

Freedle examined data based on dif-ferential item functioning (DIF) of SAT items,a technique that compares the responsesof two subgroups of examinees matchedon prociency. The technique controls foroverall ability to answer a tests items correctly,thereby isolating whether other characteristics,typically race or gender, are associated withdifferential performance on particular items.Freedle found that easy items (those that mostexaminees got correct) on the verbal section ofthe SAT favored whites. Freedle hypothesized

that easy items contain words and conceptsthat are often vague. Examinees esh out theitems meanings by calling on their own lifeexperiences, creating different interpretationsacross cultural groups. Hard items containwords and concepts that are specialized, moretechnical, more precise, learned through edu-cation, and with meanings that do not varyacross cultural groups.

The study was replicated in 2010 byMaria Veronica Santelices and Mark Wilson

employing a later SAT data set, more robuststatistical controls, and a different method-ology that addressed critiques of Freedlesstudy. They conrmed the relationshipbetween item difculty and differences byrace on verbal items, but found no suchdifferences with Hispanic students or amongany subgroups on math items. 26 Again,blackwhite differences favoring whitesemerged on the SATs easier items.

In addition to the psychometric issues

raised by these studies, policy concernsabout the consequences of emphasizingeither aptitude or achievement have beenraised by other observers of American edu-cation. Writing in Education Week in 1993,Michael Kirst and Henry Rowen urged col-leges and universities to base admissions onachievement tests linked to high school cur-riculum, not on aptitude-oriented tests suchas the SAT. Such a move would enhance


21/36


minority access to higher education, Kirstand Rowen argued, citing research by JohnBishop of Cornell University showing thatminoritymajority differentials tend to besmaller (in standard deviation units) onachievement tests (including the LTT NAEP)than on aptitude tests (including the SAT). 27

What We DidThe LTT and Main NAEPs are given in dif-ferent years and to slightly different popula-tions. The LTT is age-based; whereas theMain NAEP is grade-based. We rst pairedLTT NAEP data with Main NAEP data. The2004 LTT was paired with the 2005 MainNAEP and the 2008 LTT with the 2009Main. Age 9 on the LTT was paired with 4thgrade on the Main, age 13 with 8th grade,

and age 17 with 12th grade. These pairingsare conventional in NAEP analysis, but theymay introduce bias or inuence the ndingsreported below in unknown ways.

We then calculated the achievementgaps for both tests on the groups dened byfour SES variables: students qualifying forfree and reduced lunch vs. those who do not,black vs. white, Hispanic vs. white, and stu-dents who are English language learners vs.those who are not. The gaps were standard-ized by dividing the point gap by the tests

standard deviation. The standardized gaps arereported in the tables below, along with thedifferences between the LTT and Main.

ResultsTable 2-1 reports NAEP gaps of students whodo and do not qualify for free and reducedlunch. Qualifying for the free and reducedlunch program is often used as a proxy forpoverty in studies of student achievement. Thegaps range from 0.53 (about one-half standard

deviation) on the Mains test of reading for17-year-olds to 0.79 (about three-quarters of astandard deviation) on the Mains test of math-ematics for fourth graders. The gaps on bothtests tell us, not surprisingly, that students fromwealthier families score signicantly higher onNAEP than students from poorer families. Thegaps are generally larger on the Main than theLTTthe one exception being reading at age17/grade 12, in which the gap on the LTT isslightly larger. The six cells in the table report

an average gap of 0.70 on the Main and 0.66on the LTT, resulting in a difference of .04.Put another way, achievement gaps related topoverty status are about 0.04 standard devia-tions larger on the Main NAEP than on theLTT NAEP.

Table 2-2 shows data for blackwhitedifferences on the two tests. Again, gaps arelarger on the Main than the LTT, with theexception at age 17/grade 12 in math. Two

NAEP Main and LTT Gap DifferencesFree and Reduced Lunch

Reading Math

LTT Main MainLTT Diff* LTT Main MainLTT Diff*

Age 9/Grade 4 0.70 0.75 0.05 0.70 0.79 0.09

Age 13/Grade 8 0.61 0.69 0.07 0.72 0.76 0.04

Age 17/Grade 12 0.56 0.53 0.03 0.66 0.68 0.01

Average Main gap = 0.70, average LTT gap = 0.66, Difference = 0.04 (Main larger).* May not correspond to column di fference due to rounding.

Table

2-1

NAEP Main and LTT Gap DifferencesBlackWhite

Reading Math


Age 9/Grade 4 0.67 0.76 0.08 0.72 0.91 0.19

Age 13/Grade 8 0.60 0.79 0.19 0.84 0.91 0.08

Age 17/Grade 12 0.62 0.69 0.07 0.89 0.88 0.01

Average Main gap = 0.83, average LTT gap = 0.72, Difference = 0.11 (Main larger).* May not correspond to column di fference due to rounding.

Table

2-2


22/36



of the differences are 0.19 standard deviationunits: the gaps for reading at age 13/grade8 and for math at age 9/grade 4. These aremodest effect sizes at best, but as the differ-ence between two tests, they are noteworthy.Take the age 13/grade 8 gaps in reading, forexample. The LTT estimates the blackwhitegap as 0.60 and the Main as 0.79. That isa difference of about 32%. If policymakerswere to use the LTT NAEP to establish abaseline and then use the Main NAEP as alater measure, they would be misled to con-

clude that the blackwhite gap had grownby 32%even though this is an artifact ofdifferences between the two NAEP tests andnot a real trend. Conversely, using the MainNAEP as a baseline and the LTT as a latermeasure would give the false impression ofthe gap shrinking by 32%. The two NAEPtests are not interchangeable when it comesto measuring blackwhite test score gaps.

The Hispanicwhite gap differencesare reported in Table 2-3. The Main NAEP

reports larger gaps than the LTT for all sixsubject/age/grade combinations. In age 9/ grade 4 math the Hispanicwhite gap is 0.72standard deviations, about 57% wider thanthe 0.46 Hispanicwhite gap on the LTTNAEP. The average gap difference is 0.10between the two tests, but that is largelydriven by the large age 9/grade 4 mathdifference. Omitting that gure leavesan average test difference of 0.06 for theremaining pairings.

The gaps for English language learners(ELL) are presented in Table 2-4. Again, theMain NAEP reports larger gaps than the LTTNAEP. These are the largest test differencesfor the four SES variables analyzed. Bothdifferences at age 9/grade 4 are large: 0.19 inreading and 0.24 in math. They correspondto gaps that are 23% larger in reading and41% larger in math when recorded by theMain NAEP compared to the LTT NAEP.

The nal two tables provide summarydata. Table 2-5 reports the average gaps foreach SES variable and the gap differencebetween the Main and LTT NAEPs. Keepin mind that there is undoubtedly someoverlap of the SES categories; for example,the ELL gap differences are surely related tothe Hispanicwhite differences. In addition,

black and Hispanic students are both morelikely than white students to be poor, so aportion of the blackwhite and Hispanicwhite gaps is shared with the gap pertainingto free and reduced lunch.

In terms of groups of students basedon SES characteristics, the largest differ-ences between the LTT and Main NAEPare with ELL students (see Table 25). Therole that language plays in the two NAEP

The two NAEP tests arenot interchangeable whenit comes to measuringblack-white test score gaps

NAEP Main and LTT Gap DifferencesHispanicWhite

Reading Math


Age 9/Grade 4 0.61 0.72 0.11 0.46 0.72 0.26

Age 13/Grade 8 0.64 0.70 0.06 0.65 0.74 0.09

Age 17/Grade 12 0.53 0.56 0.03 0.67 0.68 0.01

Average Main gap = 0.69, average LTT gap = 0.59, Difference = 0.10 (Main larger).* May not correspond to column difference due to rounding.

Table

2-3

NAEP Main and LTT Gap DifferencesEnglish Language Learners

Reading Math


Age 9/Grade 4 0.81 1.00 0.19 0.59 0.83 0.24

Age 13/Grade 8 1.12 1.26 0.14 0.96 1.09 0.13

Age 17/Grade 12 1.06 1.19 0.13 0.90 1.00 0.10

Average Main gap = 1.06, average LTT gap = 0.90, Difference = 0.16 (Main larger).* May not correspond to column difference due to rounding.

Table

2-4


23/36

The Brown Center Report on American Education 21

tests could be producing different estimatesof the ELL and non-ELL achievement gap.Table 2-6 takes another cut at the data byshowing the tests differences by subjectand grade levels. Gaps in math and read-ing look similar, but the age 9/grade 4subjects stand out with the largest differ-ences. The divergence of the two NAEPs

along the dimensions of language and ageis an intriguing nding. That, along withthe headline nding that the Main NAEPconsistently reports larger SES gaps thanthe LTT NAEPs are important consider-ations for researchers who use NAEP data toinvestigate achievement gaps. They are alsoimportant factors for NAEP policymakersto think about when deciding the future ofthe Nations Report Card.

DiscussionLets tackle three questions provoked by thestudys ndings: 1) Which NAEP is right?,2)Why do the two NAEPs differ?, and3) Does it really matter?

Which NAEP is right? They may both be right. The two tests aredifferent instruments for measuring studentlearning, and although they share the NAEPlabel and would no doubt produce highly-correlated results if given to the same sample

of students, they measure different con-structs. The reading skills assessed on theLTT NAEP are not the same as the readingskills assessed on the Main NAEPnor arethe skills measured by the math tests com-parable. In the future, investigations that digdown to make comparisons on an item byitem basis may discover that the Main NAEPproduces inated estimates of achievementgaps or that the LTT understates those gaps,but this preliminary investigation only

makes a start by comparing the relative per-formance gaps of subgroups on each test.

Why do the two NAEPs differ imeasuring SES achievement gapContent differences may play a role. As justmentioned, the Main NAEP was designed toassess different skills and concepts than theLTT NAEP, which had a nearly twenty-yeartrack record when the Main NAEP was rstlaunched in 1990. In math, for example,

the LTT NAEP focuses more on computingwith whole numbers and fractions; the MainNAEP on how students apply mathematicsto solve problems. In reading, the LTT NAEPpresents shorter passages, more vocabularywords in isolation, and more items askingstudents to identify the main idea of a pas-sage. The Main NAEP has a broader selec-tion of literary forms and asks students tocompare multiple texts. 28

Summary of NAEP Main and LTT Gap DifferencesAverage by SES Variables

Average Main Gap Average LTTIncrease of Gap on Main

Compared to LTT

Free and Reduced Lunch .70 .66 5.79%

BlackWhite .83 .72 13.92%

HispanicWhite .69 .59 15.78%

ELL 1.06 .90 17.26%

Table

2-5

Summary of NAEP Main and LTT Gap DifferencesAverage By Subject and Grade

Average Main Gap Average LTTIncrease of Gap on Main

Compared to LTT

Reading 0.80 0.71 12.91%

Math 0.83 0.73 14.06%

Age 9/Grade 4 0.81 0.66 23.07%

Age 13/Grade 8 0.87 0.77 13.16%

Age 17/Grade 12 0.78 0.74 5.29%

Table

2-6

The two tests aredifferent instruments for measuring student

learning.


24/36



Neither test is absolute on thesedimensions; for instance, the Main NAEPin math includes some computation items,and the LTT includes problem solving items.The difference is one of emphasis. To bettermeasure how well students apply knowl-edge, to assess a broader array of topics, andto modernize the national assessment, theMain NAEP was created. But the upshot isthis: the contemporary skills and knowledgemeasured by the Main NAEP, comparedto the more traditional skills assessed by

the Long-Term Trend, may be more inu-enced by the background experiences ofstudents. That would widen the measuredgaps between groups dened by socioeco-nomic characteristics. If attending schoolworks to reduce such differences, that wouldalso explain why the largest discrepanciesbetween the two NAEPs appear with theyoungest pairing of students (9-year-olds/fourth graders) rather than the older ones.

Another possibility is that disadvan-

taged students are less likely to be exposedto the knowledge and skills on the MainNAEP as compared to the LTT NAEP. Feweropportunities to learn the Main NAEPscontent in schools serving a preponderanceof poor children, for example, would widengaps between students qualifying for freeand reduced lunch and those from wealthierhouseholds. The Main NAEP also posesmore open-ended, constructed-responseitems, as opposed to the multiple choice or

short answer items that are more prevalenton the LTT. Research shows that differentitem formats can affect the performance ofELL students. A 1999 study by the AmericanInstitutes for Research investigated whystudents do not answer many NAEP items.The researchers found that constructed-response items were much more likely to beskipped than multiple choice items, and thatstudents with limited English prociency

were especially prone to non-responses onconstructed-response items. 29

Does it really matter? Achievement gaps command the nationsattention. The No Child Left Behind Act estab-lished a national goal of closing achievementgaps between groups based on race, ethnic-ity, economic status, and disability. An out-pouring of scholarship has documented thepersistence of gaps, explored their causes,and evaluated interventions to reduce

them.30

Analysis of trends in achievementgaps on NAEP is a regular feature of reportsfrom the U.S. Department of Education. 31 Measuring the magnitude of achievementgaps is an important function of NAEP,and it is important that the estimates be asprecise as possible. As noted above, the factthat the two NAEP tests generate differentestimates is not necessarily a problem, but itdoes deserve investigation and explanation.

The mere existence of two NAEP tests

confuses many followers of test scores. Theconfusion surely increases when histori-cal accounts of gap trends are published.Because the Main NAEP was started in 1990,studies of achievement gaps before thenmust rely on the LTT. 32 But only nationaltrends can be tracked before 1990 becausethe LTT does not produce state scores.Studies of state gap trends must rely on theMain NAEP.33 In addition, studies examininggaps in the percentage of students attaining

specic achievement levels (for example, thepercentage of blacks and whites scoring atadvanced, procient, basic, or below basic)are conned to the Main NAEP. The LTTNAEP does not have such levels. 34

The upshot is this: studies conductedwith one NAEP may have yielded differentndings if they had been conducted with theother NAEP. State gaps would be narrower ifmeasured with an LTT NAEP. Pre-1990 gaps

Studies conducted withone NAEP may have yielded different ndingsif they had been conductedwith the other NAEP.


25/36


would be wider if measured with the MainNAEP. The widening of the blackwhite gapfrom 24 to 27 scale score points on the age 9LTT math test from 19881990, a reversal ofseveral years of steady narrowing, falls wellwithin the LTT-Main difference found bythe current study (about 26%) on measur-ing gaps in math at that age and grade. Thepoint extends to other databases based onNAEP tests. The reading and math assess-ments of the Early Childhood LongitudinalStudy (ECLS), which reveal large achieve-

ment gaps among kindergartners in theirrst days of school, are based on the MainNAEP.35 They even include some publicly-released Main NAEP items.36 The currentstudy suggests that an ECLS with a testbased on the LTT NAEP would nd nar-rower gaps.

A nal word regarding higher-orderitems. The goal of assessing higher-orderskills is laudable but must be implementedcautiously. Higher-order test items are more

prone to cultural bias than items assessingbasic skills. Two reasons. First, basic skillsare universally agreed upon, transcend cul-ture, and even constitute a world curricu-lum.37 Abstract skillsapplication, analysis,synthesis, and evaluationare varied ininterpretation and more strongly inuencedby the cultural lens through which theyare interpreted and expressed. 38 Second,higher-order items often ask for constructedresponses from examinees as opposed to

simpler response formats (e.g., multiplechoice). In mathematics, such items typi-cally go beyond showing ones work andrequire students to explain their reasoningor to communicate something about themathematics of the problems. Items involv-ing only mathematical symbols do not relyon language skills. As noted above, ELLstudents are more likely than native Englishspeakers to skip constructed response items.

The reliance on expressive language isunavoidable on English language arts tests,of course, but may introduce cultural biasinto math tests that should be avoided onour national assessment.

The goal of assessinghigher-order skills islaudable but must beimplemented cautiously.


26/36


III MISINTERPRETINGINTERNATIONALTEST SCORES

Part


27/36


Approximately fty years ago, the founders

of international assessments believed thatcomparing nations on tests of academicachievement would allow the world to serveas a laboratory of innovation and experi-mentation, that international tests couldilluminate the different approaches thatcountries take to solve educations problemsand the relative effectiveness of these efforts.The promise of international assessments isnot advanced when data are misused. 39

Dubious CausalityLets start with a mystery: what explainsthe huge leap in Polands reading scores onthe Programme for International Student

Assessment (PISA)? The test is given to15-year-olds every three years. In 2000,Poland scored 480, then 497 in 2003, and508 in 2006, a truly remarkable gain of 28points in six years. To place these gains inperspective, the international average on

the PISA was 492 in 2006. In only six years,

Poland went from below average to aboveaverage, leapfrogging such countries as Japan, Germany, and the United Kingdom.

What explains Polands great success? Almost immediately, explanations convergedon one reform: tracking. In 1999, Polandchanged its tracking system. In the old system,compulsory education (primary school) endedafter eight years of schoolingat about age 15.Of those students matriculating to secondaryschools, about half went to vocational schools

focused on preparation for industrial sectors,one-third went to technical vocational schools,and about one-fth attended academic schools(lyceum) that prepared students for college.

After the reforms, primary education endedafter six years of schooling, with the next threeyears devoted to a new compulsory, compre-hensive lower secondary level ( gymnasium).This postponed separation into vocational andacademic schools and extended compulsory

INTERNATIONAL TEST SCORES RECEIVE A LOT OF ATTENTION,especially when rst released. The press scrambles to nd punditsoffering instant analysis. Policy makers pour over the results toglean lessons for governments. Advocates look for evidence to bolsterpre-existing arguments. This section of the Brown Center Report is abouterrors that arise from such interpretations, focusing on the three mostcommon and potentially most misleading mistakes that are madewhen interpreting international tests scores.


28/36


Part III: Misinterpreting International Test Scores

education to about age 16, giving most15-year-olds an extra year of academicinstruction that they did not experienceunder the old system.

The Polish system, whether pre- orpost-reform, should not be confused withtracking in the U.S. All tracking systemsseparate students for instruction, but track-ing differs as much across countries ashealth care systems or the side of the streeton which cars are driven. In the Polishsystem, tracking begins when students

are divided into vocational or academicstreams that attend separate schools. In theU.S., 15-year-olds are typically sophomoresattending a comprehensive high school,one offering both vocational and academiccourses. There are no formal vocational oracademic streams. All sophomores must takeseveral academic courses, with the specicrequirements mandated by states. Withinacademic subjects, students may be trackedinto classes distinguished by skill level (e.g.,

Honors English, English 10, or RemedialReading) or curricular content (e.g.,Geometry, Algebra, or General Math). Thesegroupings are decided subject by subject. Itis possible for students to be enrolled in anabove-grade-level class in one subject andan at-grade or even below-grade-level classin another subject. In addition, studentsare not locked into a single track and maymove up or down a level depending on theirperformance the previous year. 40

The two key elements of the Polishtracking reform were the delay in separat-ing students by ability for one year and theextra year of exposure to academic learningthat vocational students now received. Thetwo initiatives are independent. Vocationalprograms, if policy makers so decide, mayinclude one or more years of intensive aca-demic study. Many analysts decided that theone year delay in tracking was the reason for

Polands jump in PISA scores. The rst dec-laration came from the authors of the 2006PISA score report, as evident in the 2006PISA Executive Summary: A long-termtrend in OECD countries has been to reducethe amount of separation and tracking insecondary education. The most recent majorexample of this is Poland, whose readingresults before and after this education reformare reported in PISA. 41 The ExecutiveSummary goes on to cast the reform as cost-free, producing gains without affecting other

students: Here [in Poland], an improve-ment in results among lower ability studentsimmediately after the reform was not at theexpense of higher ability students, whoseresults also rose in the subsequent period. 42

Soon after, a World Bank studypressed harder on the theme of causality,Polands reading score was below the OECDaverage in 2000, at the OECD average in2003, and above the OECD average in 2006,ranking 9th among all countries in the

world. With regard to the factors respon-sible for the improvement, the delayedtracking into vocational streams appears tobe the most critical factor. 43 The study alsomentioned the extra language and readinginstruction that prospective vocational stu-dents were receiving as a contributing factor.

The hypothesis that delayed track-ing lead to gains is certainly reasonable.Hanushek and Woessmann show that nationswith later tracking (ironically, the U.S. is

coded as a late tracker in the study)44

havehigher PISA scores than nations with earliertracking. Also, as the World Bank study andOECD analyses indicated, improvementin Polands scores was most pronounced atthe bottom of the achievement distribution.Low achievers made larger gains than highachievers. Several studies of tracking, whetherthe American or European style, have foundthat tracking can depress the achievement

Tracking differs as muchamong countries as healthcare systems or the sideof the street on which carsare driven.


29/36


of low achievers. In Poland, these are thestudents who would have been assigned to

the vocational track. It makes sense that theirachievement would rise after gaining an extrayear of academic instruction.

So whats the mystery? Isnt theevidence persuasive that tracking reformwas responsible for Polands gains. Well,no, its not. Delaying tracking by a year mayhave contributed to Polands gains, but thatsimply cannot be concluded from PISA data.Hypothesized, yes, but concluded, no.

Dont forget that 2000 was the rst

year of PISA. We dont know Polands trendbefore 2000. Polands reading gains, andindeed the growth at the bottom of the dis-tribution, may have started before the track-ing reform in 1999. The 2000 score mayhave merely detected a trend already under-way. Also, as noted above, it was the 2006PISA scores that led to tracking reform beingidentied as inuencing Polands gains.Interestingly, Polands reading score dipped

8 points in the very next release of scores, in2009. None of the analysts who ascribed the20002006 gain to the new tracking policysuspected that the 20062009 decline wasrelated to the reform.

It turns out that by 2009 analysts couldsee that rising scores at the bottom of theachievement distribution were not only hap-pening in Poland. As an OECD publicationpublished after the 2009 scores explains, Innearly all the countries that showed improvedperformance during the period, [20002009]

the percentage of low performers dropped,meaning that the number of students whoscored below the PISA baseline reading pro-ciency Level 2 was signicantly smaller in2009 than in 2000. 45

None of the other twelve nations inTable 3-1 implemented tracking reform. Andall registered gains. Indeed, six nations man-aged larger reading gains than Poland! Andsix evidenced larger gains among studentsperforming at the bottom of the distribution.

Tracking reform was not the key to these suc-cesses. And it may not be the key to Polands.Polands 1999 education reforms

were not limited to tracking. Instead, theyinvolved a complete overhaul of the Polishschool system, including such importantelements as decentralization of authority andgreater autonomy for schools, an increasein teacher salaries, a new system of nationalassessment, adoption of a core curriculumand national standards, reform of teacher

education at the university level, and a newsystem of teacher promotion. 46 Any one ofthese policiesor several in combinationmay have produced Polands gains on PISA.Some may have even produced negativeeffects, dragging down achievement, whileothers offset the losses with larger gains.The point is this: no single reform canbe plucked from several reforms adoptedsimultaneously and declared to have had the

Nations that Gained in PISA Reading (20002009)Sorted by Change in Score

PISA Reading2009

Change in Score(20002009)

Share of lowperformers

Share of topperformers

Peru 370 43 14.8 0.4

Chile 449 40 17.6 0.8

Albania 385 36 13.7 0.1

Indonesia 402 31 15.2 0.0

Latvia 484 26 12.5 1.2

Israel 474 22 6.7 3.3

Poland 500 21 8.2 1.3

Portugal 489 19 8.6 0.6

Liechtenstein 499 17 6.4 0.4

Brazil 412 16 6.2 0.8

Korea 539 15 0 7.2

Hungary 494 14 5.1 1

Germany 497 13 4.2 1.2

Source: PISA 2009 Results: Learning Trends: Changes in Student Performance Since 2000 (Volume V)(OECD, 2010).

Table

3-1


30/36


greatest positive impact. Not based on PISAdata. The data do not allow it.

Indeed, it is also possible that Polandsgains were not the result of policy choices atall. In a 2011 address on his nations PISAgains, Miroslaw Sielatycki, Under-Secretaryof State, Ministry of National Education inPoland, shared polling data indicating thatthe attitudes of the Polish people shifteddramatically around the time of the reforms.In 1993, less than half (42%) believed it wasdenitely worth it to get an education. In

2002, the percentage had jumped to 66%and in 2009 reached 68%. 47 The publics changing view of the

importance of education may have producedthe gains on PISA. Or perhaps the 1999 policychanges produced a positive effect, but onlyconditional on shifting public attitudes, withtracking reform contributing only a tiny bit tothis dynamic. No one knows for sure. To attri-bute Polands 20002006 gains in PISA readingscores to tracking reform is a clear case of dubi-

ous causality, unsupported by the evidence.

The Problem With RankingsEveryone loves rankings. Internationaltest results are frequently summarized bynational rankings, introducing all kinds ofmistakes into the interpretation of results.

When the 2009 PISA scores were released,for example, the Associated Press reportedthat Out of 34 countries assessed, the U.S.ranked 14 th in reading, 17 th in science, and

25th

in math.48

The rankings are correct, butactually 65 national and sub-national partici-pants took the test. The 34 nations referredto in the Associated Press article are theeconomically developed nations belonging tothe Organisation for Economic Co-operationand Development (OECD), omitting thedozens of developing countries that took thetest and mostly scored lower than the U.S.The Hufngton Posts headline, U.S. Falls

in World Education Rankings, Rated Averagemisleads readers into thinking that Americanperformance on PISA declined from the previoustime American students took the test. 49 In fact,in all three subjects, U.S. scores improved: from495 to 500 in reading (the previous score is from2003), from 474 to 487 in math, and from 489to 502 in science. The rated average conclusionis accurate. But the U.S. has never been ratedabove average on PISA, and the 2009 scoresshow improvement for the U.S., not decline.

Beyond the misleading press accounts,

focusing on rankings has several pitfalls. First,the condence that two close rankings aretruly different may not be established in termsof statistical signicance. This is fundamen-tally important in interpreting internationalscores. Because national scores are derivedfrom a randomly selected group of students,tests like PISA and TIMSS (and NAEP, forthat matter) contain sampling error, noisethat necessitates placing some wiggle roomaround each estimate. 50 The word error is

not pejorative in this usage but simply refersto a statistical property that must be consid-ered in making an estimate from sampling.Fortunately, sampling error can be calculated.

We refer to scores outside this wiggle room(based on what is known as the standarderror) as statistically signicantly different.

The authors of PISA and TIMSS go togreat lengths preparing tables of results thatincorporate sampling error into the estimatesof national averages. Unfortunately, the tables

are usually ignored. Lets examine a small por-tion of the relevant table from TIMSS 2007 4 th Grade Math to see how rankings can mislead ifthey are interpreted incorrectly.

Figure 3-1 displays the relative rank-ings of the top fteen nations in fourth-grademathematics, rank-ordered by their averagescale score (second column). Here is a basicguide to how to read the table. The buttons tothe right of the average scale scores point up


To attribute Polands 20002006 gains in PISAreading scores to trackingreform is a clear case ofdubious causality.


31/36


32/36



Look at Figure 3-1 again. If Kazakhstansscore of 549 were boosted 18 points, to567, it would still rank the same, in 5 th place behind Japans 568. But an increase inGermanys score of 18 points, from 525 to543, would elevate it from 12 th to 7 th place.These two hypothetical gains are identi-cal in terms of scale score points, but theyresult in very different changes in rankings.Careless observers might be impressed byGermanys increase of 18 points but con-sider Kazakhstans exact same improvement

inconsequential, and they would be seri-ously misled as to how the two nations areperforming. It cannot be stressed enough:interpret rankings cautiously.

The A+ Country FallacyThe third misinterpretation can be dealtwith quickly because it is related to the pre-vious two. It is misleading, as shown withthe case of Poland, to pull one policy from acountrys entire reform agenda and pro-

claim that it alone produced a change in testscores. It is also misleading to focus on rank-ings. Add to those two errors the practiceof focusing on a single policy from a singlecountry at the top of the rankingsand atriple mistake has been committed. Thatswhat some people do (mostly pundits, but,sadly, a few researchers who should knowbetter) when they say Country X is doingsomething I favor, and Country X scores atthe top on TIMSS and PISA; therefore what I

favor is a good idea.So two maxims right up front. First, in

order to determine whether a policy is goodor bad, there must be variation in the policyacross observable units. Put another way, ifeveryone is doing the same thing, you cannottell whether that thing is good or bad. Theremust be variation so that the nations, states,district, or schools embracing the policy canbe compared to those without it. Second, the

entire range of the distribution must be exam-ined. Focusing on the top end of any distribu-tion is bound to mislead, especially if causalityis construed from mere observables (character-istics that can be seen and recorded). Low-performing and average-performing countrieshave as much to offer in providing knowledgeabout how policies work (or dont work) as dohigh-performing countries.

In the U.S., advocates of a nationalcurriculum have for years pointed to nationsat the top of TIMSS and PISA rankings and

argued that because those countries havenational curriculums, a national curricu-lum must be good. The argument is with-out merit. What the advocates neglect toobserve is that countries at the bottom of theinternational rankings also have a nationalcurriculum. In fact, almost all nations havenational curriculums! And the most notablecountries featuring federal systems andhistorically having no national curriculum

Australia, Germany, the United Stateshave

been taking steps towards either adopting anational curriculum or reducing differencesamong state and provincial standards.

Another example of this mistake is therecent outbreak of Finland-worship in theUnited States. In the 1990s, it was Japan. Inaddition to dominating several industries,

Japan scored at the top of international testsin the 1980s and 1990s. Americans rushedto copy various aspects of Japanese educa-tion and business (e.g., lesson study, the

tenets of W. Edward Deming). Now thatPISA is the most prominent internationaltest of high school students and Finlandscores very high on all three subjects (read-ing, math, and science), it is Finland thatmany believe deserves emulation. Writing inThe New Republic in 2011, Samuel E. Abramscited small class sizes, highly-paid teach-ers, providing ample play time to students,social promotion, detracking, and the spare

Focusing on the topof any distribution isbound to mislead.


33/36


use of standardized tests as policy positionson which the U.S. should follow Finland. 51 Pasi Sahlberg argues that the American trendtowards using value-added measures forevaluating teacher performance would bemet with opposition in Finland:

Its very difcult to use this datato say anything about the effec-tiveness of teachers. If you triedto do this in my country, Finnishteachers would probably go on

strike and wouldnt return untilthis crazy idea went away. Finnsdont believe you can reliablymeasure the essence of learning. 52

The irony, of course, is that Finlandsexalted status in education circles comesfrom what Finns apparently do not believeinmeasurements of learning. The PISAdata do not clearly conrm or denySahlbergs claims about value-added evalua-

tions of teachers, but thats not the real les-son here. 53 The lesson is that the prudenceof policies, if being evaluated based on inter-national evidence, should never be judgedby a single nations reaction to them. Nor bythe experience of a few nations. Case studiescan be helpful in generating hypotheses, butnot in testing for causal effects that can begeneralized beyond the case study nationsborders. To simply select one or two favor-ite policies out of everything a few nations

do, even if those nations are high perform-ing, and declare that other countries shouldfollow their example is not very informativeand certainly not good policy analysis.

ConclusionInternational test scores are a valuableresource for policy makers and of greatinterest to the public. As this section of theBrown Center Report has illustrated, the

results are also vulnerable to misinterpreta-tion, especially when cited as evidence inpolitical battles over the wisdom of adoptingparticular policies. Three misinterpretationsare common. First, dubious claims of causal-ity. Arguments have been made that Polandstracking reforms spurred achievement gainson the PISA reading test from 20002009.It is plausible that tracking reform contrib-uted to Polands success, but the evidence isweak. Other countries accomplished gains

just as large as Polands without engaging in

tracking reform. Many of them also boostedthe scores of low achievers as much asPoland did. Moreover, Poland adopted sev-eral important reforms at the same time thattracking reform took place, and it is impos-sible to disentangle the effects of one reformfrom the others. Polish attitudes towardseducation shifted dramatically during thisperiod and may have provided cultural sup-port for achievement.

A second common mistake is the

misuse of national rankings. The test scoresunderlying two adjacent rankings, or evenseveral close rankings, may not be statisti-cally signicantly different. Rankings are notequal intervalthey differ in various partsof the distributionso a nation may jumpseveral rankings with a gain that is actuallysmaller than that of a country whose rankingstays the same. Rankings must be inter-preted with great caution.

Finally, the A+ country fallacy is a

common mistake. Pointing to a single, high-scoring country, or a group of them, anddeclaring that one or more of their policiesshould be adopted by other countries ismisguided. It combines the errors of makingdubious causal claims and misusing rankings,and by ignoring evidence from low or averagescoring nations on the same policy question,produces a triple error, a true whopper inmisinterpreting international test scores.

the prudence of policies,if being evaluated basedon international evidence,should never be judgedby a single nationsreaction to them.


34/36


NOTES

1 See the Common Core State Standards Initiativewebsite on About the Standards, http://www.corestandards.org/about-the-standards2 Jason L. Riley, Was the $5 Billion Worth It? WallStreet Journal, July 23, 2011.3 John Cronin, Michael Dahlin, Deborah Adkins, andG. Gage Kingsbury, The Prociency Illusion (Washington,DC: The Thomas B. Fordham Institute, 2007).4 See Jennifer Medina, Standards Raised, More

Students Fail Tests, New York Times, July 29, 2010, p A1.Paul E. Peterson and Frederick Hess, Keeping an Eye onState Standards, Education Next, Vol 6, No. 3, Summer2006. Michigan Department of Education, State BoardGives Nod to Improve Standards for State AssessmentScores, September 13, 2011, http://www.michigan.gov/ mde/0,4615,7140--262249--,00.html.5 Sheila Byrd Carmichael, Gabrielle Martino, KathleenPorter-Magee, and W. Stephen Wilson, The State of StateStandards and the Common Corein 2010 (Washington,DC: The Thomas B. Fordham Institute, 2010).6 Paul E. Peterson and Frederick Hess, Keeping an Eye onState Standards, Education Next, Vol 6, No. 3, Summer 2006.7 Tom Loveless, Are States Honestly Reporting TestScores?The 2006 Brown Center Report on AmericanEducation: How Well Are American Students Learning? (Washington, DC: The Brookings Institution, 2006),pp. 2129.8 Jason L. Riley, Was the $5 Billion Worth It? WallStreet Journal, July 23, 2011.9 U.S. Department of Education, Chapter 8:Instructional Materials, The Final Report of the NationalMathematics Advisory Panel 2008 (Washington, DC:United States Department of Education, 2008).10 Andrew Porter, In Common Core, Little to Cheer

About, Education Week, August 10, 2011, pp. 2425.11 R. James Milgram and Sandra Stotsky, Fair toMiddling: A National Standards Progress Report (Boston,MA: Pioneer Institute, 2010); Sandra Stotsky, and Zeev

Wurman, Common Cores Standards Still Dont Make theGrade: Why Massachusetts and California Must RegainControl Over Their Academic Destinies (Boston, MA:Pioneer Institute, 2010); Zeev Wurman and Bill Evers,National Standards Would Harm Math Curriculum,San Francisco Chronicle, July 30, 2010, p. A14.

12 Closing the Door on Innovation, http://www.k12in-novation.com/Manifesto/_V2_Home.html13 Russ Whitehurst, Dont Forget the Curriculum,Brown Center Letters on Education # 3 (Washington, DC:Brookings Institution, October 2009), http://www.brook-ings.edu/papers/2009/1014_curriculum_whitehurst.aspx14 National Center for Education Statistics, Mapping 2005 State Prociency Standards Onto the NAEP Scales (NCES 2007482). (Washington, DC: U.S. Departmentof Education, 2007).15 V. Bandeira de Mello, Mapping State ProciencyStandards Onto the NAEP Scales: Variation and Change inState Standards for Reading and Mathematics, 20052009(NCES 2011458). (Washington, DC: U.S. Departmentof Education, 2011).16 Curtis C. McKnight, Model for the Second InternationalMathematics Study, SIMS Bulletin, 4 (1979), pp. 639.

17 A recent study of math topics taught by 7,000 teachersin 27 states shows a huge redundancy in topics. More than70% of the math topics taught in K8 are repeated from theprevious grade. State standards are just as repetitive, butin support of the current studys ndings, there is no cor-relation between the redundancy of state standards and theredundancy of instruction (Morgan Polikoff, in press, TheRedundancy of Mathematics Instruction in U.S. Elementaryand Middle Schools, Elementary School Journal).18 See The Great Curriculum Debate: How ShouldWe Teach Reading and Math? edited by Tom Loveless(Washington, DC: The Brookings Institution, 2001).19 James S. Coleman, et al., Equality of EducationalOpportunity (Washington, DC: U.S. Dept. of Health,Education, and Welfare, Ofce of Education, 1966);Christopher Jencks, et al., Inequality: A Reassessment of the

Effect of Family and Schooling in America (New York: BasicBooks, Inc., 1972).20 For contrasting positions on the inheritability ofintelligence, see Richard J. Hernstein and Charles Murray,Bell Curve: Intelligence and Class Structure in America (New York: Free Press, 1994) and Intelligence, Genes, andSuccess: Scientists Respond to The Bell Curve, edited byBernie Devil, Stephen E. Fienberg, Daniel P. Resnick, andKathryn Roeder (New York: Springer-Verlag, 1997).

21 495 F. Supp. 926 (N.D.Cal, 1979).22 Christopher Jencks, Racial Bias in Testing in TheBlackWhite Test Score Gap, edited by Christopher Jencksand Meredith Phillips (Washington, DC: The BrookingsInstitution, 1998), pp. 5585.23 Walt Gardner, The SAT-ACT Duel, Education Week,

June 25, 2010.24 Neil J. Dorans, C. F. Lyu, M. Pommerich, and W. M.Houston, Concordance Between ACT Assessment andRecentered SAT I Sum Scores, College and University, 73(1997), pp. 2434.25 Roy Freedle, Correcting the SATs Ethnic andSocial-Class Bias: A Method for Reestimating SAT Scores,Harvard Educational Review,73 (2003), pp. 143.26 Jay Mathews, The Bias Questions, The Atlantic 292(2003), pp. 130140; Maria Veronica Santelices andMark Wilson, Unfair Treatment? The Case of Freedle,the SAT, and the Standardization Approach to DifferentialItem Functioning, Harvard Educational Review, 80(2010), pp. 106134.27 Michael Kirst and Henry Rowan, Why We ShouldReplace Aptitude Tests with Achievement Tests, EducationWeek, September 8, 1993, citing, John Bishop, Incentivesfor Learning: Why American High School StudentsCompare So Poorly to their Counterparts OverseasResearch in Labor Economics, Vol. 11 (1990), pp. 1752.28 http://nces.ed.gov/nationsreportcard/about/ltt_main_diff.asp29 Pamela M. Jakwerth, Frances B. Stancavage, andEllen D. Reed, NAEP Validity Studies: An Investigation ofWhy Students Do Not Respond to Questions (Washington,DC: American Institutes for Research, 1999).30 For example, see The BlackWhite Test Score Gap,edited by Christopher Jencks and Meredith Phillips(Washington, DC: The Brookings Institution, 1998);

Bridging the Achievement Gap, edited by John E. Chubband Tom Loveless (Washington, DC: The BrookingsInstitution, 2002); Abigail Thenstrom and StephanThenstrom, No Excuses: Closing the Racial Gap in Learning(New York, NY: Simon & Schuster, 2003); and UnnishedBusiness: Closing the Racial Achievement Gap in Our Schools,edited by Pedro A. Noquera and Jean Yonemura Wing(San Francisco, CA: JosseyBass, 2006).31 Alan Vanneman, Linda Hamilton, Janet Baldwin

Anderson, and Taslima Rahman, Achievement Gaps: HowBlack and White Students Perform in Mathematics andReading on the National Assessment of Educational Progress (Washington, DC: NCES, 2009).32 See Jaekyung Lee Racial and Ethnic AchievementGap Trends: Reversing the Progress Toward Equity?Educational Researcher 31, pp. 312; Larry V. Hedges and

Amy Nowell Test Score Convergence Since 1965, TheBlackWhite Test Score Gap, edited by Christopher Jencksand Meredith Phillips (Washington, DC: The BrookingsInstitution, 1998), pp. 149181.33 See Anna Habash Rowan, Daria Hall, and KatiHaycock, Gauging the Gaps: A Deeper Look at Student Achievement (Washington, DC: The Education Trust,2010) and state proles prepared by the Center onEducation Policy (www.cep-dc.org).34 The LTT does have ve performance levels, anchoredat 50 point intervals on the LTT scale, but they are rarelyreferred to by analysts. The Main NAEP achievement levelsare set at points describing what students know and areable to do, differ by each grade and subject, and employcategories (e.g., procient) that are also used by state tests.35 David T. Burkam and Valerie E. Lee, Inequality atthe Starting Gate (Washington, DC: Economic PolicyInstitute, 2002).

36 http://nces.ed.gov/ecls/kinderassessments.asp37 David Baker and Gerald K. LeTendre, NationalDifferences, Global Similarities: World Culture and the Futureof Schooling (Stanford, CA: Stanford University Press, 2005).38 Application, analysis, synthesis, and evaluationare the four highest levels of Blooms Taxonomy ofEducational Objectives. Recall and comprehension arethe two lowest levels and usually considered to be aptdescriptors of basic skills. Benjamin S. Bloom, Taxonomy of

Educational Objectives (Boston, MA: Allyn and Bacon, 1956).39 This section is based on an address given to the 52 nd IEA General Assembly in Dublin, Ireland (October 1013,2010). Appreciation is extended to Jan-Eric Gustafsson,co-presenter and helpful collaborator on the project. For ahistory of international testing by IEA, See Ina V.S. Mullisand Michael O. Martin TIMSS in Perspective, LessonsLearned, edited by Tom Loveless (Washington, DC: TheBrookings Institution, 2007), pp. 936.40 Tom Loveless, The Tracking Wars (Washington, DC:The Brookings Institution, 1999).41 Executive Summary, PISA 2006: Science Competencies for Tomorrows World (OECD, 2007), p. 39.42 Executive Summary, PISA 2006: Science Competencies for Tomorrows World (OECD, 2007), p. 39.43 The World Bank, Successful Education Reform:Lessons from Poland, 2010, Europe and Central AsiaKnowledge Brief , 34 (2010), p. 3 .44 Eric A. Hanushek and Ludger Woessmann, DoesEducational Tracking Affect Performance and Inequality?Differences-in-Differences Evidence Across Countries,Economic Journal, Royal Economic Society, 116 (2005), pp.C6376, 03. The studys coding the U.S. as a late trackingnation (after age 15) does not make sense for two reasons.Based on the European-style tracking that the study usesto dene tracking (separating students by ability intoseparate schools) the U.S. basically never tracks, exceptfor a few elite exam schools in a small number of bigcities. Moreover, coding the U.S. as a non-tracking nationbefore age 15 obscures the huge amount of tracking thatgoes on in middle school mathematics courses and thecommon practice of within-class ability grouping begin-ning with rst grade reading instruction. Bottom line: theU.S. tracks and ability groups students a lot before age15 but almost never uses the between-school model ofseparation and differentiation.

45 OECD, Improving Performance: Leading from theBottom, PISA In Focus, 2 (2011), p. 1.46 Miroslaw Sielatycki, Under-Secretary of State,Ministry of National Education, Poland: Successes andChallenges Educational Reform, 14th OECD JapanSeminar, Tokyo, Japan, June 2829, 2011.47 Miroslaw Sielatycki, Under-Secretary of State,Ministry of National Education, Poland: Successesand Challenges Educational Reform, 14 th OECD JapanSeminar, Tokyo, Japan, June 2829, 2011.48 Christine Armario, Wake-Up Call: U.S. Students TrailGlobal Leaders, Associated Press Wire, December 7, 2010.49 The U.S. Falls In World Education Rankings, RatedAverage,The Hufngton Post, December 7, 2010.50 Means reported from tests given to census popula-tionsin other words, to all studentsdo not havesampling error. They are simply the average of all test scores.

State tests given to all students in a particular grade or forhigh school graduation are examples of census populations.This does not automatically make them better tests becausetests may possess other types of error. The proper term forthe wiggle room of an estimate is condence interval.51 Samuel E. Abrams, The Children Must Play: Whatthe United States Could Learn from Finland AboutEducation Reform, The New Republic, January 28, 2011.52 Hechinger Report, What Can We Learn fromFinland? Hechinger Report, December 9, 2010, availableat hechingerreport.org.53 PISA 2006 Science Competencies for Tomorrows World (OECD, 2006), pp. 240244, 276277. PISA 209 Results:What Students Know and Can Do (OECD, 2009), pp. 4647.


35/36

THE BROOKINGS

INSTITUTIONSTROBE TALBOTTPresident

DARRELL WEST Vice President and DirectorGovernance Studies Program

BROWN CENTER STAFF

GROVER RUSS WHITEHURSTSenior Fellow and Director

TOM LOVELESSSenior Fellow

MATTHEW CHINGOSFellow

MICHAEL GALLAHERResearch Analyst

SARAH WHITFIELDStaff Assistant

PAUL T. HILLNon-resident Senior Fellow

DIANE RAVITCHNon-resident Senior Fellow

Views expressed in this report are solelythose of the author.


36/36

BROOKINGS

1775 Massachusetts Avenue, NW Washington, D.C. 20036 Tel: 2027976000 Fax: 2027976004 www.brookings.edu

Documents

0216 Brown Education Loveless