accurate

How Stable are Value-Added Estimates

Via

Highlights:

  • A teacher’s value-added score in one year is partially but not fully predictive of her performance in the next.
  • Value-added is unstable because true teacher performance varies and because value-added measures are subject to error.
  • Two years of data does a meaningfully better job at predicting value added than does just one. A teacher’s value added in one subject is only partially predictive of her value added in another, and a teacher’s value added for one group of students is only partially predictive of her valued added for others.
  • The variation of a teacher’s value added across time, subject, and student population depends in part on the model with which it is measured and the source of the data that is used.
  • Year-to-year instability suggests caution when using value-added measures to make decisions for which there are no mechanisms for re-evaluation and no other sources of information.

Introduction

Value-added models measure teacher performance by the test score gains of their students, adjusted for a variety factors such as the performance of students when they enter the class. The measures are based on desired student outcomes such as math and reading scores, but they have a number of potential drawbacks. One of them is the inconsistency in estimates for the same teacher when value added is measured in a different year, or for different subjects, or for different groups of students.

Some of the differences in value added from year to year result from true differences in a teacher’s performance. Differences can also arise from classroom peer effects; the students themselves contribute to the quality of classroom life, and this contribution changes from year to year. Other differences come from the tests on which the value-added measures are based; because test scores are not perfectly accurate measures of student knowledge, it follows that they are not perfectly accurate gauges of teacher performance.

In this brief, we describe how value-added measures for individual teachers vary across time, subject, and student populations. We discuss how additional research could help educators use these measures more effectively, and we pose new questions, the answers to which depend not on empirical investigation but on human judgment. Finally, we consider how the current body of knowledge, and the gaps in that knowledge, can guide decisions about how to use value-added measures in evaluations of teacher effectiveness.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-stability/"]Continue reading...[/readon2]

Like an untested drug?

If there was a new drug that had shown some promise in curing the flu in lab trials, but there were also some indicators that it had some nasty, in some cases fatal, side effects, do you think that drug required more testing and trials, or should be rushed into production and given out as widely as possible?

That's basically the scenario we have with using value add scores for high stakes decision making when it comes to teachers. Sure no one is actually going to die, but if corporate education reformers have their way, many might falsely lose their jobs, and the money wasted will never be used to actually educate a student, and what of the opportunity cost of missing out on getting effective reforms into the classroom being missed?

Given the context-dependency of the estimators’ ability to produce accurate results, however, and our current lack of knowledge regarding prevailing assignment practices, VAM-based measures of teacher performance, as currently applied in practice and research, must be subjected to close scrutiny regarding the methods used and interpreted with a high degree of caution.

Methods of constructing estimates of teacher effects that we can trust for high-stakes evaluative purposes must be further studied, and there is much left to investigate. In future research, we will explore the extent to which various estimation methods, including more sophisticated dynamic treatment effects estimators, can handle further complexity in the DGPs.

The addition of test measurement error, school effects, time-varying teacher effects, and different types of interactions among teachers and students are a few of many possible dimensions of complexity that must be studied. Finally, diagnostics are needed to identify the structure of decay and prevailing teacher assignment mechanisms. If contextual norms with regard to grouping and assignment mechanisms can be deduced from available data, then it may be possible to determine which estimators should be applied in a given context.

We must be able to prove that evaluations and the metrics that make them up are fair, accurate and stable, and if they are to have any real benefit they must ultimately demonstrate a cost effective way to improve student achievement and education quality. We're simply not there yet and pretending we are is dangerous and carries some very real risks.