components

The Toxic Trifecta in Current Legislative Models for Teacher Evaluation

A relatively consistent legislative framework for teacher evaluation has evolved across states in the past few years. Many of the legal concerns that arise do so because of inflexible, arbitrary and often ill-conceived yet standard components of this legislative template. There exist three basic features of the standard model, each of which is problematic on its own regard, and those problems become multiplied when used in combination.

First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1]Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions.

Second, the standard evaluation model proposed in legislation requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. That is, a teacher in the 25%ile or lower when combining all evaluation components might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective. Further, the teacher’s placement into these groupings may largely if not entirely hinge on their rating in the student achievement growth component of their evaluation. Teachers on either side of the arbitrary cutoff are undoubtedly statistically no different from one another. In many cases as with the recently released teacher effectiveness estimates on New York City teachers, the error ranges for the teacher percentile ranks have been on the order of 35%ile points (on average, up to 50% with one year of data). Assuming that there is any real difference between the teacher at the 25%ile and 26%ile (as their point estimate) is a huge unwarranted stretch. Placing an arbitrary, rigid, cut-off score into such noisy measures makes distinctions that simply cannot be justified especially when making high stakes employment decisions.

Third, the standard evaluation model proposed in legislation places exact timelines on the conditions for removal of tenure. Typical legislation dictates that teacher tenure either can or must be revoked and the teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).[2]As such, whether a teacher rightly or wrongly falls just below or just above the arbitrary cut-offs that define performance categories may have relatively inflexible consequences.

The Forced Choice between “Bad” Measures and “Wrong” Ones

[readon2 url="http://nepc.colorado.edu/blog/toxic-trifecta-bad-measurement-evolving-teacher-evaluation-policies"]Continue reading...[/readon2]

Poor schools can’t win

Without question, designing school and district rating systems is a difficult task, and Ohio was somewhat ahead of the curve in attempting to do so (and they’re also great about releasing a ton of data every year). As part of its application for ESEA waivers, the state recently announced a newly-designed version of its long-standing system, with the changes slated to go into effect in 2014-15. State officials told reporters that the new scheme is a “more accurate reflection of … true [school and district] quality.”

In reality, however, despite its best intentions, what Ohio has done is perpetuate a troubled system by making less-than-substantive changes that seem to serve the primary purpose of giving lower grades to more schools in order for the results to square with preconceptions about the distribution of “true quality.” It’s not a better system in terms of measurement – both the new and old schemes consist of mostly the same inappropriate components, and the ratings differentiate schools based largely on student characteristics rather than school performance.

So, whether or not the aggregate results seem more plausible is not particularly important, since the manner in which they’re calculated is still deeply flawed. And demonstrating this is very easy.

Rather than get bogged down in details about the schemes, the short and dirty version of the story is that the old system assigned six possible ratings based mostly on four measures: AYP; the state’s performance index; the percent of state standards met; and a value-added growth model (see our post for more details on the old system). The new system essentially retains most of the components of the old, but the formula is a bit different and it incorporates a new “achievement and graduation gap” measure that is supposed to gauge whether student subgroups are making acceptable progress. The “gap” measure is really the only major substantive change to the system’s components, but it basically just replaces one primitive measure (AYP) with another.*

Although the two systems yield different results overall, the major components of both – all but the value-added scores – are, directly or indirectly, “absolute performance” measures. They reflect how highly students score, not how quickly they improve. As a result, the measures are telling you more about the students that schools serve than the quality of instruction that they provide. Making high-stakes decisions based on this information is bad policy. For example, closing a school in a low-income neighborhood based on biased ratings not only means that one might very well be shutting down an effective school, but also that it’s unlikely it will be replaced by a more effective alternative.

Put differently, the most important step in measuring schools’ effectiveness is controlling for confounding observable factors, most notably student characteristics. Ohio’s ratings are driven by them. And they’re not the only state.

(Important side note: With the exception of the state’s value-added model, which, despite the usual issues, such as instability, is pretty good, virtually every indicator used by the state is a cutpoint-based measure. These are severely limited and potentially very misleading in ways that are unrelated to the bias. I will not be discussing these issues in this post, but see the second footnote below this post, and here and here for some related work.)**

The components of the new system

The severe bias in the new system’s constituent measures is unmistakable and easy to spot. To illustrate it in an accessible manner, I’ve identified the schools with free/reduced lunch rates that are among the highest 20 percent (highest quintile) of all non-charter schools in the state. This is an imperfect proxy for student background, but it’s sufficient for our purposes. (Note: charter schools are excluded from all these figures.)

The graph below breaks down schools in terms of how they scored (A-F) on each of the four components in the new system; these four grades are averaged to create the final grade. The bars represent the percent of schools (over 3,000 in total) receiving each grade that are in the highest poverty quintile. For example, looking at the last set of bars on the right (value-added), 17 percent of the schools that received the equivalent of an F (red bar) on the value-added component were high-poverty schools.

[readon2 url="http://shankerblog.org/?p=5511"]Continue reading[/readon2]