vam

Lawsuit filed over unfair teacher evaluations

The Washington Post is reporting on a lawsuit being filed by Florida teachers, that cold shake the foundations of a lot of teacher evaluation systems both in Florida, but across the country, including here in Ohio

A group of teachers and their unions filed a lawsuit on Tuesday against Florida officials that challenges the state’s educator evaluation system, under which many teachers are evaluated on the standardized test scores of students they do not teach.

The seven teachers who filed the lawsuit include Kim Cook, who, as this post explains, was evaluated at Irby Elementary, a K-2 school where she works and was named Teacher of the Year last December. But 40 percent of that evaluation was based on test scores of students at Alachua Elementary, a school into which Irby feeds, whom she never taught.

Kim Cook's story is very unneverving

Here’s the crazy story of Kim Cook, a teacher at Irby Elementary, a K-2 school which feeds into Alachua Elementary, for grades 3-5, just down the road in Alachua, Fla. She was recently chosen by the teachers at her school as their Teacher of the Year.

Her plight stems back to last spring when the Florida Legislature passed Senate Bill 736, which mandates that 40 percent of a teacher’s evaluation must be based on student scores on the state’s standardized tests, a method known as the value-added model, or VAM. It is essentially a formula that supposedly tells how much “value” a teacher has added to a student’s test score. Assessment experts say it is a terrible way to evaluate teachers but it has still been adopted by many states with the support of the Obama administration.

Since Cook’s school only goes through second grade, her school district is using the FCAT scores from the third graders at Alachua Elementary School to determine the VAM score for every teacher at her school.

Alachua Elementary School did not do well in 2011-12 evaluations that just came out; it received a D. Under the VAM model, the state awarded that school — and Cook’s school, by default — 10 points out of 100 for their D.

In this school district, there are three components to teacher evaluations:
1. A lesson study worth 20 percent. In the lesson study, small groups of teachers work together to create an exemplary lesson, observe one of the teachers implement it, critique the teacher’s performance and discuss improvement.
2. Principal appraisal worth 40 percent of overall score.
3. VAM data (scores from the standardized Florida Comprehensive Assessment Test scores for elementary schools) worth 40 percent of the overall score.

Cook received full points on her lesson study: 100 x .20 (20%) = 20 points
Cook received an 88/100 from her former principal: 88/100 x .40 (40%) = 35.2 points
On VAM data — points awarded by the state for the FCAT scores at Alachua Elementary School: 10/100 x .40 (40%) = 4 points
Total points that she received: 59.2 (Unsatisfactory)

Here's a video of Kim speaking on this issue

We imaging this to be the first, not the last legal action against many of the provisions corporate education reformers are trying to cram into teacher evaluations.

Gates Foundation Wastes More Money Pushing VAM

Makes it hard to trust the corporate ed reformers when they goose their stats as badly as this.

Any attempt to evaluate teachers that is spoken of repeatedly as being "scientific" is naturally going to provoke rebuttals that verge on technical geek-speak. The MET Project's "Ensuring Fair and Reliable Measures of Effective Teaching" brief does just that. MET was funded by the Bill & Melinda Gates Foundation.

At the center of the brief's claims are a couple of figures (“scatter diagrams” in statistical lingo) that show remarkable agreement in VAM scores for teachers in Language Arts and Math for two consecutive years. The dots form virtual straight lines. A teacher with a high VAM score one year can be relied on to have an equally high VAM score the next, so Figure 2 seems to say.

Not so. The scatter diagrams are not dots of teachers' VAM scores but of averages of groups of VAM scores. For some unexplained reason, the statisticians who analyzed the data for the MET Project report divided the 3,000 teachers into 20 groups of about 150 teachers each and plotted the average VAM scores for each group. Why?

And whatever the reason might be, why would one do such a thing when it has been known for more than 60 years now that correlating averages of groups grossly overstates the strength of the relationship between two variables? W.S. Robinson in 1950 named this the "ecological correlation fallacy." Please look it up in Wikipedia. The fallacy was used decades ago to argue that African-Americans were illiterate because the correlation of %-African-American and %-illiterate was extremely high when measured at the level of the 50 states. In truth, at the level of persons, the correlation is very much lower; we’re talking about differences as great as .90 for aggregates vs .20 for persons.

Just because the average of VAM scores for 150 teachers will agree with next year's VAM score average for the same 150 teachers gives us no confidence that an individual teacher's VAM score is reliable across years. In fact, such scores are not — a fact shown repeatedly in several studies.

[readon2 url="http://ed2worlds.blogspot.com/2013/01/gates-foundation-wastes-more-money.html"]Continue reading...[/readon2]

Shame, errors and demoralizing

Shame, errors and demoralizing, just some of the emerging rhetoric being used since the NYT and other publications went ahead and published teacher level value add scores. A great number of articles have been written decrying the move.

Perhaps most surprising of all was Bill Gates, in a piece titled "Shame Is Not the Solution". In it, Gates argues

Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren’t a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students’ feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

Putting sophisticated personnel systems in place is going to take a serious commitment. Those who believe we can do it on the cheap — by doing things like making individual teachers’ performance reports public — are underestimating the level of resources needed to spur real improvement.
[...]
Developing a systematic way to help teachers get better is the most powerful idea in education today. The surest way to weaken it is to twist it into a capricious exercise in public shaming. Let’s focus on creating a personnel system that truly helps teachers improve.

Following that, Matthew Di Carlo at the Shanker institute took a deeper look at the data and the error margins inherent in using it

First, let’s quickly summarize the imprecision associated with the NYC value-added scores, using the raw datasets from the city. It has been heavily reported that the average confidence interval for these estimates – the range within which we can be confident the “true estimate” falls – is 35 percentile points in math and 53 in English Language Arts (ELA). But this oversimplifies the situation somewhat, as the overall average masks quite a bit of variation by data availability.
[...]
This can be illustrated by taking a look at the categories that the city (and the Journal) uses to label teachers (or, in the case of the Times, schools).

Here’s how teachers are rated: low (0-4th percentile); below average (5-24); average (25-74); above average (75-94); and high (95-99).

To understand the rocky relationship between value-added margins of error and these categories, first take a look at the Times’ “sample graph” below.

That level of error in each measurement renders the teacher grades virtually useless. But that was just the start of the problems, as David Cohen notes in a piece titled "Big Apple’s Rotten Ratings".

So far, I think the best image from the whole fiasco comes from math teacher Gary Rubinstein, who ran the numbers himself, a bunch of different ways. The first analysis works on the premise that a teacher should not become dramatically better or worse in one year. He compared the data for 13,000 teachers over two consecutive years and found this – a virtually random distribution:

First of all, as I’ve repeated every chance I get, the three leading professional organizations for educational research and measurement (AERA, NCME, APA) agree that you cannot draw valid inferences about teaching from a test that was designed and validated to measure learning; they are not the same thing. No one using value-added measurement EVER has an answer for that.

Then, I thought of a set of objections that had already been articulated on DiCarlo’s blog by a commenter. Harris Zwerling called for answers to the following questions if we’re to believe in value-added ratings:

1. Does the VAM used to calculate the results plausibly meet its required assumptions? Did the contractor test this? (See Harris, Sass, and Semykina, “Value-Added Models and the Measurement of Teacher Productivity” Calder Working Paper No. 54.)
2. Was the VAM properly specified? (e.g., Did the VAM control for summer learning, tutoring, test for various interactions, e.g., between class size and behavioral disabilities?)
3. What specification tests were performed? How did they affect the categorization of teachers as effective or ineffective?
4. How was missing data handled?
5. How did the contractors handle team teaching or other forms of joint teaching for the purposes of attributing the test score results?
6. Did they use appropriate statistical methods to analyze the test scores? (For example, did the VAM provider use regression techniques if the math and reading tests were not plausibly scored at an interval level?)
7. When referring back to the original tests, particularly ELA, does the range of teacher effects detected cover an educationally meaningful range of test performance?
8. To what degree would the test results differ if different outcome tests were used?
9. Did the VAM provider test for sorting bias?

Today, education historian Diane Ravitch published a piece titled "How to Demoralize Teachers", which draws all these problems together to highlight how counter productive the effort is becoming

Gates raises an important question: What is the point of evaluations? Shaming employees or helping them improve? In New York City, as in Los Angeles in 2010, it's hard to imagine that the publication of the ratings—with all their inaccuracies and errors—will result in anything other than embarrassing and humiliating teachers. No one will be a better teacher because of these actions. Some will leave this disrespected profession—which is daily losing the trappings of professionalism, the autonomy requisite to be considered a profession. Some will think twice about becoming a teacher. And children will lose the good teachers, the confident teachers, the energetic and creative teachers, they need.
[...]
Interesting that teaching is the only profession where job ratings, no matter how inaccurate, are published in the news media. Will we soon see similar evaluations of police officers and firefighters, legislators and reporters? Interesting, too, that no other nation does this to its teachers. Of course, when teachers are graded on a curve, 50 percent will be in the bottom half, and 25 percent in the bottom quartile.

Is this just another ploy to undermine public confidence in public education?

It's hard to conclude that for some, that might very well be the goal.

Some Hows and Whys of Value Add Modelling

We thought it would be useful to provide a quick primer on what Value Add actually is, and how it is calculated, in somewhat explainable terms. This is a good explanation via the American Statistical Association

The principal claim made by the developers of VAM—William L. Sanders, Arnold M. Saxton, and Sandra P. Horn—is that through the analysis of changes in student test scores from one year to the next, they can objectively isolate the contributions of teachers and schools to student learning. If this claim proves to be true, VAM could become a powerful tool for both teachers’ professional development and teachers’ evaluation.

This approach represents an important divergence from the path specified by the “adequate yearly progress” provisions of the No Child Left Behind Act, for it focuses on the gain each student makes, rather than the proportion of students who attain some particular standard. VAM’s attention to individual student’s longitudinal data to measure their progress seems filled with commonsense and fairness. There are many models that fall under the general heading of VAM. One of the most widely used was developed and programmed by William Sanders and his colleagues. It was developed for use in Tennessee and has been in place there for more than a decade under the name Tennessee Value-Added Assessment System. It also has been called the “layered model” because of the way each of its annual component pieces is layered on top of another.

The model begins by representing a student’s test score in the first year, y1, as the sum of the district’s average for that grade, subject, and year, say μ1; the incremental contribution of the teacher, say θ1; and systematic and unsystematic errors, say ε1. When these pieces are put together, we obtain a simple equation for the first year:

y1 = μ1+ θ1+ ε1 (1)
or
Student’s score (1) = district average (1) + teacher effect (1) + error (1)

There are similar equations for the second, third, fourth, and fifth years, and it is instructive to look at the second year’s equation, which looks like the first except it contains a term for the teacher’s effect from the previous year:

y2 = μ2+ θ1+ θ2+ ε2 . (2)
or
Student’s score (2) = district average (2) + teacher effect (1) + teacher (2) + error (2)

To assess the value added (y2 – y1), we merely subtract equation (1) from equation (2) and note that the effect of the teacher from the first year has conveniently dropped out. While this is statistically convenient, because it leaves us with fewer parameters to estimate, does it make sense? Some have argued that although a teacher’s effect lingers beyond the year the student had her/him, that effect is likely to shrink with time.

Although such a model is less convenient to estimate, it more realistically mirrors reality. But, not surprisingly, the estimate of the size of a teacher’s effect varies depending on the choice of model. How large this choice-of-model effect is, relative to the size of the “teacher effect” is yet to be determined. Obviously, if it is large, it diminishes the practicality of the methodology.

Recent research from the Rand Corporation shows a shift from the layered model to one that estimates the size of the change of a teacher’s effect from one year to the next suggests that almost half of the teacher effect is accounted for by the choice of model.

One cannot partition student effect from teacher effect without information about how the same students perform with other teachers. In practice, using longitudinal data and obtaining measures of student performance in other years can resolve this issue. The decade of Tennessee’s experience with VAM led to a requirement of at least three years’ data. This requirement raises the concerns when (i) data are missing and (ii) the meaning of what is being tested changes with time.

The Ohio Department of Education has papers, here, that discuss the technical details of how VAM is done in Ohio.

BattleforKids.org provided us this information

Here's a brief example of both analysis that are used in Ohio. Both are from the EVAAS methodology produced by SAS:

Value-added analysis is produced in two different ways in Ohio:
1. MRM analysis (Multivariate Response Model, also known as the mean gain approach); and
2. URM analysis (Univariate Response Model, also known as the predicted mean approach).

The MRM analysis is used for the Ohio value-added results in grades 4-8 math and reading. It can onlybe used when tests are uniformly administered in consecutive grades. Through this approach, district, school and teacher level results are compared to a growth standard. The OAA assessments provide the primary data for this approach.

The URM analysis is used for expanded value-added results. Currently this analysis is provided through the Battelle for Kids' (BFK) SOAR and Ohio Value-Added High Schools (OVAHS) projects. The URM analysis is used when tests are not given in consecutive grades. This approach "pools" together districts that use of the same sequence of particular norm reference tests. In the URM analysis, prior test data are used to produce a prediction of how a student is likely to score on a particular test, given the average experience in that school. For example, results from prior OAA and TerraNovaT results are used as predictors for the ACT end-of-course exams. Differences between students' predictions and their actual/observed scores are used to produce school and teacher effects. The URM analysis is normalized each year based on the performance of other schools in the pool that year. This approach means that a comparison is made to the growth of the average school or teacher for that grade/subject in the pool.

Value add high stakes use cautioned

The American Mathematics Society just published a paper titled "Mathematical Intimidation:Driven by the Data", that discusses the issues with using Value Add in high stakes decision making, such as teacher evaluation. It's quite a short read, and well worth the effort.

Many studies by reputable scholarly groups call for caution in using VAMs for high-stakes decisions about teachers.

A RAND research report: The esti- mates from VAM modeling of achieve- ment will often be too imprecise to support some of the desired inferences [McCaffrey 2004, 96].

A policy paper from the Educational Testing Service’s Policy Information Center: VAM results should not serve as the sole or principal basis for making consequential decisions about teach- ers. There are many pitfalls to making causal attributions of teacher effective- ness on the basis of the kinds of data available from typical school districts. We still lack sufficient understanding of how seriously the different technical problems threaten the validity of such interpretations [Braun 2005, 17].

A report from a workshop of the Na- tional Academy of Education: Value- added methods involve complex sta- tistical models applied to test data of varying quality. Accordingly, there are many technical challenges to ascer- taining the degree to which the output of these models provides the desired estimates [Braun 2010]
[...]
Making policy decisions on the basis of value- added models has the potential to do even more harm than browbeating teachers. If we decide whether alternative certification is better than regular certification, whether nationally board cer- tified teachers are better than randomly selected ones, whether small schools are better than large, or whether a new curriculum is better than an old by using a flawed measure of success, we almost surely will end up making bad decisions that affect education for decades to come.

This is insidious because, while people debate the use of value-added scores to judge teachers, almost no one questions the use of test scores and value-added models to judge policy. Even people who point out the limitations of VAM ap- pear to be willing to use “student achievement” in the form of value-added scores to make such judgments. People recognize that tests are an im- perfect measure of educational success, but when sophisticated mathematics is applied, they believe the imperfections go away by some mathematical magic. But this is not magic. What really happens is that the mathematics is used to disguise the prob- lems and intimidate people into ignoring them—a modern, mathematical version of the Emperor’s New Clothes.

The entire, short paper, can be read below.

Mathematical Intimidation: Driven by the Data