assigned

The Science of Value-Added Evaluation

"A value-added analysis constitutes a series of personal, high-stakes experiments conducted under extremely uncontrolled conditions".

If drug experiments were conduted like VAM we might all have 3 legs or worse

Value-added teacher evaluation has been extensively criticized and strongly defended, but less frequently examined from a dispassionate scientific perspective. Among the value-added movement's most fervent advocates is a respected scientific school of thought that believes reliable causal conclusions can be teased out of huge data sets by economists or statisticians using sophisticated statistical models that control for extraneous factors.

Another scientific school of thought, especially prevalent in medical research, holds that the most reliable method for arriving at defensible causal conclusions involves conducting randomized controlled trials, or RCTs, in which (a) individuals are premeasured on an outcome, (b) randomly assigned to receive different treatments, and (c) measured again to ascertain if changes in the outcome differed based upon the treatments received.

The purpose of this brief essay is not to argue the pros and cons of the two approaches, but to frame value-added teacher evaluation from the latter, experimental perspective. For conceptually, what else is an evaluation of perhaps 500 4th grade teachers in a moderate-size urban school district but 500 high-stakes individual experiments? Are not students premeasured, assigned to receive a particular intervention (the teacher), and measured again to see which teachers were the more (or less) efficacious?

Granted, a number of structural differences exist between a medical randomized controlled trial and a districtwide value-added teacher evaluation. Medical trials normally employ only one intervention instead of 500, but the basic logic is the same. Each medical RCT is also privy to its own comparison group, while individual teachers share a common one (consisting of the entire district's average 4th grade results).

From a methodological perspective, however, both medical and teacher-evaluation trials are designed to generate causal conclusions: namely, that the intervention was statistically superior to the comparison group, statistically inferior, or just the same. But a degree in statistics shouldn't be required to recognize that an individual medical experiment is designed to produce a more defensible causal conclusion than the collected assortment of 500 teacher-evaluation experiments.

How? Let us count the ways:

  • Random assignment is considered the gold standard in medical research because it helps to ensure that the participants in different experimental groups are initially equivalent and therefore have the same propensity to change relative to a specified variable. In controlled clinical trials, the process involves a rigidly prescribed computerized procedure whereby every participant is afforded an equal chance of receiving any given treatment. Public school students cannot be randomly assigned to teachers between schools for logistical reasons and are seldom if ever truly randomly assigned within schools because of (a) individual parent requests for a given teacher; (b) professional judgments regarding which teachers might benefit certain types of students; (c) grouping of classrooms by ability level; and (d) other, often unknown, possibly idiosyncratic reasons. Suffice it to say that no medical trial would ever be published in any reputable journal (or reputable newspaper) which assigned its patients in the haphazard manner in which students are assigned to teachers at the beginning of a school year.
  • Medical experiments are designed to purposefully minimize the occurrence of extraneous events that might potentially influence changes on the outcome variable. (In drug trials, for example, it is customary to ensure that only the experimental drug is received by the intervention group, only the placebo is received by the comparison group, and no auxiliary treatments are received by either.) However, no comparable procedural control is attempted in a value-added teacher-evaluation experiment (either for the current year or for prior student performance) so any student assigned to any teacher can receive auxiliary tutoring, be helped at home, team-taught, or subjected to any number of naturally occurring positive or disruptive learning experiences.
  • When medical trials are reported in the scientific literature, their statistical analysis involves only the patients assigned to an intervention and its comparison group (which could quite conceivably constitute a comparison between two groups of 30 individuals). This means that statistical significance is computed to facilitate a single causal conclusion based upon a total of 60 observations. The statistical analyses reported for a teacher evaluation, on the other hand, would be reported in terms of all 500 combined experiments, which in this example would constitute a total of 15,000 observations (or 30 students times 500 teachers). The 500 causal conclusions published in the newspaper (or on a school district website), on the other hand, are based upon separate contrasts of 500 "treatment groups" (each composed of changes in outcomes for a single teacher's 30 students) versus essentially the same "comparison group."
  • Explicit guidelines exist for the reporting of medical experiments, such as the (a) specification of how many observations were lost between the beginning and the end of the experiment (which is seldom done in value-added experiments, but would entail reporting student transfers, dropouts, missing test data, scoring errors, improperly marked test sheets, clerical errors resulting in incorrect class lists, and so forth for each teacher); and (b) whether statistical significance was obtained—which is impractical for each teacher in a value-added experiment since the reporting of so many individual results would violate multiple statistical principles.

[readon2 url="http://www.edweek.org/ew/articles/2013/01/16/17bausell.h32.html"]Continue reading...[/readon2]

The Toxic Trifecta in Current Legislative Models for Teacher Evaluation

A relatively consistent legislative framework for teacher evaluation has evolved across states in the past few years. Many of the legal concerns that arise do so because of inflexible, arbitrary and often ill-conceived yet standard components of this legislative template. There exist three basic features of the standard model, each of which is problematic on its own regard, and those problems become multiplied when used in combination.

First, the standard evaluation model proposed in legislation requires that objective measures of student achievement growth necessarily be considered in a weighting system of parallel components. Student achievement growth measures are assigned, for example, a 40 or 50% weight alongside observation and other evaluation measures. Placing the measures alongside one another in a weighting scheme assumes all measures in the scheme to be of equal validity and reliability but of varied importance (utility) – varied weight. Each measure must be included, and must be assigned the prescribed weight – with no opportunity to question the validity of any measure. [1]Such a system also assumes that the various measures included in the system are each scaled such that they can vary to similar degrees. That is, that the observational evaluations will be scaled to produce similar variation to the student growth measures, and that the variance in both measures is equally valid – not compromised by random error or bias. In fact, however, it remains highly likely that some components of the teacher evaluation model will vary far more than others if by no other reasons than that some measures contain more random noise than others or that some of the variation is attributable to factors beyond the teachers’ control. Regardless of the assigned weights and regardless of the cause of the variation (true or false measure) the measure that varies more will carry more weight in the final classification of the teacher as effective or not. In a system that places differential weight, but assumes equal validity across measures, even if the student achievement growth component is only a minority share of the weight, it may easily become the primary tipping point in most high stakes personnel decisions.

Second, the standard evaluation model proposed in legislation requires that teachers be placed into effectiveness categories by assigning arbitrary numerical cutoffs to the aggregated weighted evaluation components. That is, a teacher in the 25%ile or lower when combining all evaluation components might be assigned a rating of “ineffective,” whereas the teacher at the 26%ile might be labeled effective. Further, the teacher’s placement into these groupings may largely if not entirely hinge on their rating in the student achievement growth component of their evaluation. Teachers on either side of the arbitrary cutoff are undoubtedly statistically no different from one another. In many cases as with the recently released teacher effectiveness estimates on New York City teachers, the error ranges for the teacher percentile ranks have been on the order of 35%ile points (on average, up to 50% with one year of data). Assuming that there is any real difference between the teacher at the 25%ile and 26%ile (as their point estimate) is a huge unwarranted stretch. Placing an arbitrary, rigid, cut-off score into such noisy measures makes distinctions that simply cannot be justified especially when making high stakes employment decisions.

Third, the standard evaluation model proposed in legislation places exact timelines on the conditions for removal of tenure. Typical legislation dictates that teacher tenure either can or must be revoked and the teacher dismissed after 2 consecutive years of being rated ineffective (where tenure can only be achieved after 3 consecutive years of being rate effective).[2]As such, whether a teacher rightly or wrongly falls just below or just above the arbitrary cut-offs that define performance categories may have relatively inflexible consequences.

The Forced Choice between “Bad” Measures and “Wrong” Ones

[readon2 url="http://nepc.colorado.edu/blog/toxic-trifecta-bad-measurement-evolving-teacher-evaluation-policies"]Continue reading...[/readon2]

EdChoice Program Designated Public Schools list

The Ohio Department of Education has just released their EdChoice Program Designated Public Schools list.

The following is a list of schools that have either:
1) been in Academic Emergency or Academic Watch for two of the past three school years (2008-2009, 2009-2010, and 2010-2011) or
2) ranked in the lowest 10% of public school buildings by performance index score for two of the past three school years and are therefore designated for the EdChoice Scholarship Program.

The following students are eligible to apply for an EdChoice scholarship:

  • Students currently enrolled in and attending EdChoice-designated public school buildings in their district of residence
  • Students enrolled in a community school who would otherwise be assigned to one of the EdChoice-designated public school buildings, or;
  • Students currently enrolled in a regular public school in their district of residence or community school who would be assigned to attend one of the EdChoice-designated public school buildings for the upcoming school year. This provision is for students moving from one level of school to the next. For example, public school students moving from elementary to middle school would be eligible to apply if the middle school that they would be assigned to in the fall is designated for EdChoice;
  • Students eligible to enter kindergarten in the next school year who would be assigned to one of the EdChoice public school buildings.

Ed Choice Eligible Public Schools