measurement

Value-Added Versus Observations

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many – perhaps most – teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time – i.e., that they are unreliable. And it’s true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse,” despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn’t by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

[readon2 url="http://shankerblog.org/?p=5621"]Continue reading part I[/readon2]

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Let’s start by quickly defining what is usually meant by validity. Put simply, whereas reliability is about the precision of the answers, validity addresses whether we’re using them to answer the correct questions. For example, a person’s weight is a reliable measure, but this doesn’t necessarily mean it’s valid for gauging the risk of heart disease. Similarly, in the context of VA and observations, the question is: Are these indicators, even if they can be precisely estimated (i.e., they are reliable), measuring teacher performance in a manner that is meaningful for student learning?

[readon2 url="http://shankerblog.org/?p=5670"]Continue reading part II[/readon2]

Shame, errors and demoralizing

Shame, errors and demoralizing, just some of the emerging rhetoric being used since the NYT and other publications went ahead and published teacher level value add scores. A great number of articles have been written decrying the move.

Perhaps most surprising of all was Bill Gates, in a piece titled "Shame Is Not the Solution". In it, Gates argues

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren’t a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students’ feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

Putting sophisticated personnel systems in place is going to take a serious commitment. Those who believe we can do it on the cheap — by doing things like making individual teachers’ performance reports public — are underestimating the level of resources needed to spur real improvement.
[...]
Developing a systematic way to help teachers get better is the most powerful idea in education today. The surest way to weaken it is to twist it into a capricious exercise in public shaming. Let’s focus on creating a personnel system that truly helps teachers improve.

Following that, Matthew Di Carlo at the Shanker institute took a deeper look at the data and the error margins inherent in using it

First, let’s quickly summarize the imprecision associated with the NYC value-added scores, using the raw datasets from the city. It has been heavily reported that the average confidence interval for these estimates – the range within which we can be confident the “true estimate” falls – is 35 percentile points in math and 53 in English Language Arts (ELA). But this oversimplifies the situation somewhat, as the overall average masks quite a bit of variation by data availability.
[...]
This can be illustrated by taking a look at the categories that the city (and the Journal) uses to label teachers (or, in the case of the Times, schools).

Here’s how teachers are rated: low (0-4th percentile); below average (5-24); average (25-74); above average (75-94); and high (95-99).

To understand the rocky relationship between value-added margins of error and these categories, first take a look at the Times’ “sample graph” below.

That level of error in each measurement renders the teacher grades virtually useless. But that was just the start of the problems, as David Cohen notes in a piece titled "Big Apple’s Rotten Ratings".

So far, I think the best image from the whole fiasco comes from math teacher Gary Rubinstein, who ran the numbers himself, a bunch of different ways. The first analysis works on the premise that a teacher should not become dramatically better or worse in one year. He compared the data for 13,000 teachers over two consecutive years and found this – a virtually random distribution:

First of all, as I’ve repeated every chance I get, the three leading professional organizations for educational research and measurement (AERA, NCME, APA) agree that you cannot draw valid inferences about teaching from a test that was designed and validated to measure learning; they are not the same thing. No one using value-added measurement EVER has an answer for that.

Then, I thought of a set of objections that had already been articulated on DiCarlo’s blog by a commenter. Harris Zwerling called for answers to the following questions if we’re to believe in value-added ratings:

1. Does the VAM used to calculate the results plausibly meet its required assumptions? Did the contractor test this? (See Harris, Sass, and Semykina, “Value-Added Models and the Measurement of Teacher Productivity” Calder Working Paper No. 54.)
2. Was the VAM properly specified? (e.g., Did the VAM control for summer learning, tutoring, test for various interactions, e.g., between class size and behavioral disabilities?)
3. What specification tests were performed? How did they affect the categorization of teachers as effective or ineffective?
4. How was missing data handled?
5. How did the contractors handle team teaching or other forms of joint teaching for the purposes of attributing the test score results?
6. Did they use appropriate statistical methods to analyze the test scores? (For example, did the VAM provider use regression techniques if the math and reading tests were not plausibly scored at an interval level?)
7. When referring back to the original tests, particularly ELA, does the range of teacher effects detected cover an educationally meaningful range of test performance?
8. To what degree would the test results differ if different outcome tests were used?
9. Did the VAM provider test for sorting bias?

Today, education historian Diane Ravitch published a piece titled "How to Demoralize Teachers", which draws all these problems together to highlight how counter productive the effort is becoming

Gates raises an important question: What is the point of evaluations? Shaming employees or helping them improve? In New York City, as in Los Angeles in 2010, it's hard to imagine that the publication of the ratings—with all their inaccuracies and errors—will result in anything other than embarrassing and humiliating teachers. No one will be a better teacher because of these actions. Some will leave this disrespected profession—which is daily losing the trappings of professionalism, the autonomy requisite to be considered a profession. Some will think twice about becoming a teacher. And children will lose the good teachers, the confident teachers, the energetic and creative teachers, they need.
[...]
Interesting that teaching is the only profession where job ratings, no matter how inaccurate, are published in the news media. Will we soon see similar evaluations of police officers and firefighters, legislators and reporters? Interesting, too, that no other nation does this to its teachers. Of course, when teachers are graded on a curve, 50 percent will be in the bottom half, and 25 percent in the bottom quartile.

Is this just another ploy to undermine public confidence in public education?

It's hard to conclude that for some, that might very well be the goal.

Teacher Attitudes about Compensation Reform

We want to bring to your attention 3 study papers from The National Center for Analysis of Longitudinal Data in Education Research (CALDER). Some of it is pretty dense reading and probably isn't for everyone on a sunny summer's day. However we are heading into a period where lots of these issues are now front and center in how it impacts the teaching profession. It's worth a few minutes to simply read the conclusions if the entire paper is a little too much.

As Ohio moves towards high stakes teacher evaluations using student test scores, and of course, merit pay based on these high stakes evaluations it will become increasingly important for educators to understand these issues. Understanding the strengths and weaknesses and the state of current understanding will be crucial, for it is certain that there are lots of corporate education reformers who care less about whether new approaches actually work, and care more about profit seeking or their ideologically driven agendas.

The first paper looks at Value-Added Models and the Measurement of Teacher Productivity, and unsurprisingly finds that while VAM has some interesting uses, the data and measurement techniques are not mature enough to be reliable for high stakes decision making.

Value-Added Models and the Measurement of Teacher Productivity

The second paper looks at Teacher Attitudes About Compensation Reform, and finds that

We conclude with a reminder that our analysis says nothing of the politics of adoption. Whether a district is able to successfully adopt compensation reform clearly depends on its relationship with its teachers union, not just the attitudes of individual teachers. And while the WSTCS presents these various incentive plans as if they are separate from each other, if compensation reform is to have the types of effects that advocates and reformers hope for, various combinations of incentives may need to be considered: not just merit pay alone but merit‐pay combined with subject‐area pay and/or combat pay and/or NBPTS incentives. Teacher opinions about such combinations are an important topic for future research.

The final paper we want to bring to your attention covers Stepping Stones Principal Career Paths and School Outcomes, simply to highlight that school and student performance is affected by many complex variables, including school leadership itself.

We hope you continue to find the research we bring to your attention useful and informative and if you are aware of any research we haven't uncovered please let us know.