Shame, errors and demoralizing

Shame, errors and demoralizing, just some of the emerging rhetoric being used since the NYT and other publications went ahead and published teacher level value add scores. A great number of articles have been written decrying the move.

Perhaps most surprising of all was Bill Gates, in a piece titled "Shame Is Not the Solution". In it, Gates argues

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren’t a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students’ feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

Putting sophisticated personnel systems in place is going to take a serious commitment. Those who believe we can do it on the cheap — by doing things like making individual teachers’ performance reports public — are underestimating the level of resources needed to spur real improvement.
[...]
Developing a systematic way to help teachers get better is the most powerful idea in education today. The surest way to weaken it is to twist it into a capricious exercise in public shaming. Let’s focus on creating a personnel system that truly helps teachers improve.

Following that, Matthew Di Carlo at the Shanker institute took a deeper look at the data and the error margins inherent in using it

First, let’s quickly summarize the imprecision associated with the NYC value-added scores, using the raw datasets from the city. It has been heavily reported that the average confidence interval for these estimates – the range within which we can be confident the “true estimate” falls – is 35 percentile points in math and 53 in English Language Arts (ELA). But this oversimplifies the situation somewhat, as the overall average masks quite a bit of variation by data availability.
[...]
This can be illustrated by taking a look at the categories that the city (and the Journal) uses to label teachers (or, in the case of the Times, schools).

Here’s how teachers are rated: low (0-4th percentile); below average (5-24); average (25-74); above average (75-94); and high (95-99).

To understand the rocky relationship between value-added margins of error and these categories, first take a look at the Times’ “sample graph” below.

That level of error in each measurement renders the teacher grades virtually useless. But that was just the start of the problems, as David Cohen notes in a piece titled "Big Apple’s Rotten Ratings".

So far, I think the best image from the whole fiasco comes from math teacher Gary Rubinstein, who ran the numbers himself, a bunch of different ways. The first analysis works on the premise that a teacher should not become dramatically better or worse in one year. He compared the data for 13,000 teachers over two consecutive years and found this – a virtually random distribution:

First of all, as I’ve repeated every chance I get, the three leading professional organizations for educational research and measurement (AERA, NCME, APA) agree that you cannot draw valid inferences about teaching from a test that was designed and validated to measure learning; they are not the same thing. No one using value-added measurement EVER has an answer for that.

Then, I thought of a set of objections that had already been articulated on DiCarlo’s blog by a commenter. Harris Zwerling called for answers to the following questions if we’re to believe in value-added ratings:

1. Does the VAM used to calculate the results plausibly meet its required assumptions? Did the contractor test this? (See Harris, Sass, and Semykina, “Value-Added Models and the Measurement of Teacher Productivity” Calder Working Paper No. 54.)
2. Was the VAM properly specified? (e.g., Did the VAM control for summer learning, tutoring, test for various interactions, e.g., between class size and behavioral disabilities?)
3. What specification tests were performed? How did they affect the categorization of teachers as effective or ineffective?
4. How was missing data handled?
5. How did the contractors handle team teaching or other forms of joint teaching for the purposes of attributing the test score results?
6. Did they use appropriate statistical methods to analyze the test scores? (For example, did the VAM provider use regression techniques if the math and reading tests were not plausibly scored at an interval level?)
7. When referring back to the original tests, particularly ELA, does the range of teacher effects detected cover an educationally meaningful range of test performance?
8. To what degree would the test results differ if different outcome tests were used?
9. Did the VAM provider test for sorting bias?

Today, education historian Diane Ravitch published a piece titled "How to Demoralize Teachers", which draws all these problems together to highlight how counter productive the effort is becoming

Gates raises an important question: What is the point of evaluations? Shaming employees or helping them improve? In New York City, as in Los Angeles in 2010, it's hard to imagine that the publication of the ratings—with all their inaccuracies and errors—will result in anything other than embarrassing and humiliating teachers. No one will be a better teacher because of these actions. Some will leave this disrespected profession—which is daily losing the trappings of professionalism, the autonomy requisite to be considered a profession. Some will think twice about becoming a teacher. And children will lose the good teachers, the confident teachers, the energetic and creative teachers, they need.
[...]
Interesting that teaching is the only profession where job ratings, no matter how inaccurate, are published in the news media. Will we soon see similar evaluations of police officers and firefighters, legislators and reporters? Interesting, too, that no other nation does this to its teachers. Of course, when teachers are graded on a curve, 50 percent will be in the bottom half, and 25 percent in the bottom quartile.

Is this just another ploy to undermine public confidence in public education?

It's hard to conclude that for some, that might very well be the goal.