Using Classroom Artifacts to Measure Instructional Practice in Middle School Mathematics: A Two-State Field Test Hilda Borko, Suzanne Arnold, Beth Dorman,

Using Classroom Artifacts to Measure Instructional Practice in Middle School

Mathematics: A Two-State Field Test

Hilda Borko, Suzanne Arnold, Beth Dorman, Karin Kuffner (CU-Boulder)

Brian Stecher, Mary Lou Gilbert, Alice Wood (RAND Corporation)

CRESST Conference 2004September 10, 2004

Artifact Packages for Characterizing Instructional Practice:

A Validation Study

Goal: an instrument to capture instructional practice reliably and efficiently

Rationale for artifact packages Richer descriptions than surveys

Fewer resource demands than case studies

Validation study to investigate reliability and validity

The Scoop Metaphor

“What is it like to learn mathematics in your classroom?”

A Scoop of Classroom Material

One way that scientists study unfamiliar territory (e.g., freshwater wetlands, Earth’s crust) is to scoop up all the material they find in one place and take it to the laboratory for careful examination. Analysis of a typical Scoop of material can tell a great deal about the area from which it was taken.

We would like to do something similar in classrooms, i.e., scoop up a typical week’s worth of material and use it to learn about the class from which it was taken. The artifacts would include assignments, homework, tests, projects, problem solving activities, and anything else that is part of instruction during the week.

The Scoop Notebook

“Scoop” a typical week’s worth of instructional materials

Variety of methods for capturing instructional practice

Daily calendar

Instructional materials

Samples of student work

Photographs

Teacher Reflections

Methods

Participants

36 middle school mathematics teachers

Teachers from Colorado (23) and California (13)

Variety of curricula, traditional to reform

Data from 30 teachers used in reliability and validity analyses

Data collection

Scoop Notebook completed by teacher (5 days of instruction)

Researcher observation and ratings (2 - 3 days)

Audiotape of instruction (8 teachers, 2 - 3 days)

Scoring Guide

11 Dimensions of Classroom Practice

Collaborative Grouping Explanation & Justification

Structure of Lessons Problem Solving

Multiple Representations Assessment

Use of Mathematical Tools Connections & Applications

Cognitive Depth Overall

Discourse Community (Notebook Completeness)

(Confidence)

Rating Observations and Notebooks

Five-point rating scale

Scoring Guide with descriptions and examples for each dimension: high (5)

medium (3)

low (1)

Scoring GuideExample: Problem Solving

Overall Description: Extent to which instructional activities enable students to identify, apply and adapt a variety of strategies to solve problems. Extent to which problems that students solve are complex and allow for multiple solutions. [NOTE: this dimension focuses more on the nature of the activity/task than the enactment. To receive a high rating, problems should not be routine or algorithmic; they should consistently require novel, challenging, and/or creative thinking.]

High: Students work on problems that are complex, integrate a variety of mathematical topics, and draw upon previously learned skills. Problems lend themselves to multiple solution strategies and have multiple possible solutions. Problem solving is an integral part of the class’ mathematical activity, and students are regularly asked to formulate problems as well as solve them.

Example: During a unit on measurement, students regularly solve problems such as: “Estimate the length of your family’s car. If you lined this car up bumper to bumper with other cars of the same size, about how many car lengths would equal the length of a blue whale?” After solving the problem on their own, students compare their solutions and discuss their solution strategies. The teacher reinforces the idea that there are many different strategies for solving the problem and a variety of answers because the students used different estimates of car length to solve the problem.

Ratings of Instructional Practice

Notebook Only Contents of Scoop Notebook

Gold Standard Observations and contents of Scoop

Notebook

Notebook + Discourse Transcripts of audio-taped classroom

lessons and contents of Scoop Notebook

Reliability Research Questions

Do raters agree on the scores they assign to the dimensions of classroom practice, based on the Scoop Notebook?

Is agreement among raters higher for some dimensions than others?

Is agreement among raters higher for some teachers than others?

Agreement Among Raters: Calculation Procedures

Three raters per notebook; pairs of ratings compared

1-2-3: three pairs (1,2), (1,3), & (2,3)

Exact agreement = 0%

Within 1 rating point = 67%

4-4-1: three pairs (4,4), (4,1), (4,1)

Exact agreement = 33%

Within 1 rating point = 33%

Agreement by Dimension

Average ratings across teachers close to 3.0 for all dimensions

Relatively high levels of agreement for all dimensions

Exact agreement ranged from 21.1% to 44.3%

Agreement within 1 point ranged from 70.1% to 82.3%

Agreement fairly consistent across dimensions

Agreement by Teacher

Wide range of values

Average notebook ratings (1.55 to 4.21)

Exact agreement: 12.0% to 60.5%

Agreement within 1: 57.5% to 97.0%

No apparent relationship to:

Average notebook rating (traditional versus reform practices)

Notebook completeness

Rater confidence

Validity Research Questions

1. Do ratings based only on the Scoop Notebook agree with ratings based on the Scoop Notebook and classroom observations (“Gold Standard” ratings)? Is agreement higher for some dimensions than others? Is agreement higher for some teachers than others?

2. Are there differences in the ratings of Colorado teachers and California teachers?

3. Do ratings based on the Scoop Notebook and transcripts of classroom lessons agree with Gold Standard ratings?

Methods Similar to the Reliability Analysis

Comparisons between average Notebook Only rating (averaged across 3 raters) and Gold Standard rating

Two levels of agreement (on 5-point scale)

Within 0.33

Within 0.67

Agreement by Dimension

Moderately high levels of agreement for all dimensions

Agreement within 0.33 ranged from 30.0% to 53.3% across the 11 dimensions

Agreement within 0.67 ranged from 43.3% to 66.7%

Differences in agreement among dimensions make sense

Structure of Lessons “easy” to rate

Mathematical Discourse and Assessment more “difficult” to rate

Agreement by Teacher Pattern similar to reliability data

Large differences among teachers in levels of agreement



Level of agreement is not related to:

Average notebook rating

Notebook completeness

Rater confidence

Notebooks Detect Known Differences in Curriculum

Average ratings differed for teachers using traditional vs. reform-based curricula Notebook ratings: 3.42 vs. 2.59

Gold standard ratings: 3.47 vs. 2.30

Differences between ratings varied by dimension and match known differences in the curricula Ratings most alike on Structure of Lessons and

Assessments

Ratings most different on Cognitive Depth, Discourse Community, etc.

Validity Analyses with Classroom Transcripts

How do the ratings based on the Scoop Notebook and transcripts of classroom lessons compare to Gold Standard ratings?

To what extent does analysis of classroom discourse provide additional insights about instructional practices?

Discourse Plus Scoop Notebook vs. Gold Standard

Exact agreement occurred in 45.4% of cases

Range across dimensions: Grouping: 14.3%

Structure of Lessons: 71.4%

Agreement within 1.0 point occurred in 92.2% of cases.

Agreement within 1 was 100% for 7 of 11 dimensions

In general, relatively high levels of agreement

Qualitative Analysis

On which dimensions does discourse provide more information and insights than the Scoop Notebook alone?

Mathematical Discourse Community

Explanation/Justification

Cognitive Depth

Connections/Applications

Assessment

Additional Insights: Mathematical Discourse Community

How teacher solicits, explores, & attends to student thinking

How teacher models & emphasizes use of mathematical language

Student-to-student communication

Common classroom discourse patterns (e.g., IRE; more open ended)

Conclusions:Feasibility of the Approach

Teachers were interested, supportive, and cooperative

Teachers were able to follow artifact collection instructions well

Notebooks returned in timely manner

Student work represented a broad range of curriculum and instructional activities

Photographs and reflections were descriptive

Conclusions: Reliability and Validity Agreement among raters is reasonably high for all

dimensions and very high for some

Agreement between Notebook Only ratings and Gold Standard ratings is moderately high for all dimensions

Some dimensions and teaching practices present greater challenges than others for artifact-based tools such as the Scoop Notebook

Raters reported struggling with some dimensions (e.g., Mathematical Discourse Community) more than others

Information about classroom discourse provides additional insights about some dimensions

Disagreements among raters may be greater when there are inconsistencies in the data

Implications and Future Directions Scoop Notebook is useful for describing

instructional practice in broad terms

Results do not support use of the Scoop Notebook to make judgments about individual teachers

Additional research needed to answer questions such as: Why are some classrooms and teachers more difficult to

rate than others?

Are there systematic differences among individual raters?

Possible future uses of the Scoop Notebook Tool for professional development

Trace changes in teachers over time or across different instructional units

Documents

Using Classroom Artifacts to Measure Instructional Practice in Middle School Mathematics: A Two-State Field Test Hilda Borko, Suzanne Arnold, Beth Dorman,