measures

The Foolish Endeavor of Rating Ed Schools by Graduates’ Value-Added

April 24, 2013 in Article

Knowing that I’ve been writing a fair amount about various methods for attributing student achievement to their teachers, several colleagues forwarded to me the recently released standards of the Council For the Accreditation of Educator Preparation, or CAEP. Specifically, several colleagues pointed me toward Standard 4.1 Impact on Student Learning:

4.1.The provider documents, using value-added measures where available, other state-supported P-12 impact measures, and any other measures constructed by the provider, that program completers contribute to an expected level of P-12 student growth.

http://caepnet.org/commission/standards/standard4/

Now, it’s one thing when relatively under-informed pundits, think tankers, politicians and their policy advisors pitch a misguided use of statistical information for immediate policy adoption. It’s yet another when professional organizations are complicit in this misguided use. There’s just no excuse for that! (political pressure, public polling data, or otherwise)

The problems associated with attempting to derive any reasonable conclusions about teacher preparation program quality based on value-added or student growth data (of the students they teach in their first assignments) are insurmountable from a research perspective.

Worse, the perverse incentives likely induced by such a policy are far more likely to do real harm than any good, when it comes to the distribution of teacher and teaching quality across school settings within states.

First and foremost, the idea that we can draw this simple line below between preparation and practice contradicts nearly every reality of modern day teacher credentialing and progress into and through the profession:

one teacher prep institution –> one teacher –> one job in one school –> one representative group of students

The modern day teacher collects multiple credentials from multiple institutions, may switch jobs a handful of times early in his/her career and may serve a very specific type of student, unlike those taught by either peers from the same credentialing program or those from other credentialing programs. This model also relies heavily on minimal to no migration of teachers across state borders (well, either little or none, or a ton of it, so that a state would have a large enough share of teachers from specific out of state institutions to compare). I discuss these issues in earlier posts.

Setting aside that none of the oversimplified assumptions of the linear diagram above hold (a lot to ignore!), let’s probe the more geeky technical issues of trying to use VAM to evaluate ed school effectiveness.

There exist a handful of recent studies which attempt to tease out certification program effects on graduate’s student’s outcomes, most of which encounter the same problems. Here’s a look at one of the better studies on this topic.

Mihaly, K., McCaffrey, D. F., Sass, T. R., & Lockwood, J. R. (2012). Where You Come From or Where You Go?

Specifically, this study tries to tease out the problem that arises when graduates of credentialing programs don’t sort evenly across a state. In other words, a problem that ALWAYS occurs in reality!

Researchy language tends to downplay these problems by phrasing them only in technical terms and always assuming there is some way to overcome them with statistical tweak or two. Sometimes there just isn’t and this is one of those times!

[readon2 url="http://schoolfinance101.wordpress.com/2013/02/25/revisiting-the-foolish-endeavor-of-rating-ed-schools-by-graduates-value-added/"]Continue reading...[/readon2]

ODE publishes propaganda

January 30, 2013 in Article

prop·a·gan·da
/ˌpräpəˈgandə/
Noun
1. Information, esp. of a biased or misleading nature, used to promote or publicize a particular political cause or point of view.
2. The dissemination of such information as a political strategy.

That aptly describes the latest document published by the Ohio Department of Education, titled "Myths vs. Facts about the Ohio Teacher Evaluation System". The document lists 10 alleged myths about the teacher evaluation system being created. We thought we'd take a closer look at some of these alleged "myths".

1. Myth: The state is telling us what to do in local evaluations.

ODE, under a bulleted list discussing local board flexibility in creating evaluations, state "The percentages within the given range for student growth measures for the teachers in that district;" This is no longer true for teacher who have Value-add scores. These teachers (over 30% of Ohio's teaching corps) will have 50% of their evaluation based on student test scores. On this, local boards have zero flexibility, it's a state mandate. We judge aspects of this myth to actually be true

2. Myth: This is just a way to fire teachers.

ODE goes to great length to discuss how these evaluations will be great for teachers in identifying areas of improvement (though no money has been allocated for professional development). Utterly lacking is any discussion of the provision within HB153 prohibits giving preference based on seniority in determining the order of layoffs or in rehiring teachers when positions become available again, except when choosing between teachers with comparable evaluations. It is no secret that corporate education reformers such as Michelle Rhee desperately want to use evaluations for the basis of firing what they purportedly measure to be "ineffective" teachers. After all, this is exactly the process used in Washington DC where she came from. It's far too soon to call this a myth, it's more like a corporate educators goal.

3. Myth: One test in the spring will determine my fate.

It's nice that ODE stresses the importance of using multiple measures, but once again they fail to acknowledge that HB555 removed those multiple measures for 30% of Ohio's teachers. For those teachers their fate will be determined by tests. This myth is therefore true.

5. Myth: The state has not done enough work on this system – there are too many unanswered questions.

How can it be a myth when even this documents fails to state that "we're ready". SLO's have yet to be developed, Common Core is almost upon us but no one knows what the tests will be, the legislature keeps changing the rules of the game and no where near enough evaluator training has taken place to evaluate all of Ohio's teachers. Ohio isn't ready for this and that's a fact, not a myth.

6. Myth: “Value-Added” is a mysterious formula and is too volatile to be trusted.

This is perhaps one of the most egregious points of all. Study after study after study has demonstrated that Value add is volatile, unreliable and inappropriate for measuring teacher effectiveness. Their explanation conflates the use of value-add as a diagnostic tool and its use in evaluating teachers. Those are 2 very different use cases indeed.

As for it being mysterious, the formula used in Ohio is secret and proprietary - it doesn't get more mysterious than that! This claim by ODE is simply untrue and ridiculous, they ought to be embarrassed for publishing it. This myth is totally true and real and backed up by all the available scientific evidence.

7. Myth: The current process for evaluating teachers is fine just as it is.

Their explanation: "Last year, 99.7 percent of teachers around the country earned a “satisfactory” evaluation, yet many students didn’t make a year’s worth of progress in reading and are not reading at grade level." Right out of the corporate education reformers message book. Blame the teacher. Still think this isn't going to end up being about firing teachers? This myth is a straw-man, no one argues the current system is ideal, but the proposed OTES is dangerously constructed.

8. Myth: Most principals (or other evaluators) don’t have time to do this type of evaluation, so many will just report that teachers are proficient.

ODE states "Fact: Most principals are true professionals who want the teachers in their buildings to do well." But wait a minute, in Myth #7 these very same principals were handing out "satisfactory" grades like candy to 99.7% of teachers. Which is it? Are they professionals who can fairly evaluate teachers, or aren't they? We wrote about the massive administrative task faced by school administrators almost 2 years ago. Nothing has happened to alleviate those burdens, other than a $2 billion budget cut. This myth is 100% true.

9. Myth: This new evaluation system is like building the plane while we’re flying it.

ODE states: "Fact: Just as the Wright brothers built a plane, tried it by flying it, landed it, and then refined the plane they built, the new evaluation system was built, tried and revised. "

We'll just point out that 110 years have passed since the Wright Brothers first flew and the world has developed better design and project management tools since then.

10. Myth: It will be easy to implement the new teacher evaluation system.

Has anyone, anywhere said this? Or did the ODE brainstorming session run out of bad ideas at 9, and this is all they could come up with? Talk about ending with a straw-man, which frankly, given the rest of the document is probably the most appropriate ending.

ODE ought to withdraw this piece of propaganda from public view.

On the Issue of Value-add

January 29, 2013 in Article

There is a growing body of research demonstrating that "Value-Added" measures (VAM) is simply unreliable as a stand-alone measure of teacher effectiveness. When the legislature inserted language into HB 555 with no hearings or public input (or news coverage, for that matter) to eliminate the possibility of using multiple measures of student performance for teachers with value-added scores, it moved in a direction utterly lacking in scientific evidence. The new language calls on teachers to be evaluated based on a methodology that, by its very design, cannot measure the true quality of the interaction between teacher and students in the classroom. This has serious implications for students and teachers alike.

The Governor is advocating for expanded use of student test scores not only for teacher evaluation, but also for decisions involving teacher hiring, layoffs and pay. There simply is no credible expert testimony that supports such a move. Value-added measures are influenced by far too many variables beyond the control of the teacher to be used in such high-stakes decisions.

In other parts of the country where similar evaluation systems have been implemented, stories of great teachers who were branded as ineffective because of aberrations in student test data abound. (See, for example, the story of New York City 8th grade math teacher Carolyn Abbott or Washington, DC, 5th grade teacher Sarah Wysocki.) This isn't just a theoretical policy debate. Decisions made by our elected officials have real human consequences.

What follows is an accurate summation of the current scientific knwoledge of the use of VAM in evaluations.

Value Added in Evaluation

Many policy makers are enthusiastic about using value added measures (VAM) for teacher evaluation. Many states have incorporated it into teacher evaluations. Its use, however, is problematic due to concerns about accuracy, fairness, and the incentives it would create for teachers that are potentially harmful for students.

VAM has serious limitations in determining teacher effectiveness

A teacher can be ranked in the top quartile one year and sink to the middle or even the bottom the next independent of any changes they made in their own instructional practice.

A paper written for the Carnegie Knowledge Network examining this issue cited a study that found that half of the teachers in the top fifth of performance remained there the following year while 20% of them fell to the lowest two quintiles. This defies reason – how could one fifth of teachers be identified as top performers in one year but among the worst in the next?

There are many reasons for this: VAM doesn’t account for school effect, students don’t grow at the linear pace assumed by the models, the students aren’t randomly assigned and VAM seems to be worse for teachers of students who have limited English proficiency. According to a RAND corporation study, VAM scores varied depending on what test was used.

Many Researchers Caution Against Use of VAM in Teacher Evaluations as a Sole Measure

The Brookings Institute supports use of VAM but cautions that the error ranges in measurement are so wide that one can’t make precise differentiation between levels of teacher effectiveness. The RAND study mentioned above also made a similar recommendation.

Jesse Rothstein of UC Berkeley found that non-random assignment of students caused the model to demonstrate a teacher caused student growth in the year prior to having them as students.

A synthesis of available research conducted by Marzano found that teachers account for only about 13 percent of the variance in student achievement.

Student variables (including home environment, student motivation, and prior knowledge) account for 80 percent of the variance. VAM does not necessarily isolate the teacher’s contribution to student achievement growth.

Erik Hanushek, whom the Ohio General Assembly relies on for policy advice, also gives caution to the over-reliance on value added for high stakes decisions with respect to teachers:

“The bigger set of issues, however, relates to the use of teacher value-added estimates in compensation, employment, promotion, or assignment decisions. The possibility of introducing performance pay based on value-added estimates motivates much of the prior analysis of the properties of these estimates, but movement in this direction has so far been limited.” “Despite the strength of the research findings, concerns about accuracy, fairness, and potential adverse effects of incentives based on a limited set of outcomes raise worries about the use of value added estimates in education personnel and policy decisions. Many of the possible drawbacks are related to the measurement and estimation issues discussed above, but there are also concerns about incentives to cheat, adopt teaching methods that teach narrowly to tests, and ignore non-tested subjects.”

And…

“Although researchers can mitigate the effects of sampling error on estimates of teacher quality, such error would inevitably lead some successful teachers to receive low ratings and some unsuccessful teachers to receive high ratings.”

And, finally, it may have an adverse effect on students:

“In terms of fairness, any failure to account for sorting on unobservable characteristics would potentially penalize teachers given … more difficult classrooms and reward teachers given … less difficult classrooms. This could discourage educationally beneficial decisions including the assignment of more difficult or disruptive students to higher quality teachers.”

Hanushek recommends that these problems could be mitigated by combining value-added with subjective observations. Hanushek’s paper may be found here.

HB 555 Magnifies the Problematic Nature of Over-reliance on VAM to Evaluate Teachers

HB 156 and SB 316 set forth the framework for the Ohio Teacher Evaluation System (OTES) in requiring that student achievement growth accounts for 50% of a teacher’s evaluation. The law mandated that VAM, when available, must be part of the student growth calculation but didn’t specify to what degree. The Ohio Department of Education, in creating the OTES framework mandated that student growth be calculated using multiple measures and that VAM, when available, must account for at least 10% of the the whole evaluation. Presumably ODE constructed the model in this way in recognition of the limitations of VAM as a primary determinant of teacher effectiveness.

HB 555 changes the framework to require that, if VAM is available for a teacher, it must be used in proportion to the amount they teach subjects covered by VAM in their schedule. In other words, a middle school math teacher who teaches an entire day of 7th and 8th grade math would have the 50% growth measure solely determined by VAM.

The OTES model has an imbedded bias to overvalue student growth. For instance, if a teacher with a poor student growth measure can be rated no greater than “Developing” (the second lowest category) no matter how their evaluator rated their classroom performance.(Fig below)

Because of the overvaluing of student growth in the OTES teacher rating matrix, HB 555 magnifies the random errors in VAM due to selection bias, non-school factors, the effect from other teachers and the school itself which are out of the teacher’s control. When VAM is fully 50% of a teacher’s evaluation and is overvalued so that it essentially trumps any teacher rating from subjective observations, the inevitable errors that occur will cause teachers to be unfairly rated in the lowest two categories putting them at risk for dismissal or being first in line to be laid off through reduction in force.

Simply put, we don’t believe that teachers should have an element of randomness determine career risk.

Using VAM to De-Select Teachers May Have Adverse School and Labor Market Effects

If teachers believe that their VAM score can cause them to lose their jobs, they will be much more likely to hoard information and teaching methods from their colleagues. They will also resist assignment of difficult students to their class, believing that the very students who need the most help may cause them to suffer adverse career consequences.

If teachers are being asked to assume a greater amount of career risk without a commensurate rise in pay, it is less than clear that there will be a willing pool of candidates waiting to fill positions of deselected teachers. This is especially problematic in the mathematics field, where there are already shortages of willing and qualified candidates. This situation will likely be exacerbated if teachers believe that the evaluation system is inherently unfair.

There will likely be an adverse effect on students as well. Schools and teachers will choose to narrow the curriculum and in-class instruction to only that which will be tested. Such narrowing of the curriculum will strip away the enjoyable aspects of school from students’ lives.

Alternatives to the Current System

This is not to say that there is no place for VAM in a comprehensive teacher evaluation system. There are alternatives to the current system in which VAM is a prominent part of the teacher evaluation but not a primary determinant of quality and leaving sufficient margin of error.

Several states have a student growth component that is lower than 50% - DC’s impact system (the prototypical model for OTES) has recently been revised to de-escalate the role of VAM in response to the concerns about its accuracy.

Teacher resistance to VAM is not monolithic – it’s much less likely they will resist OTES if VAM were a much lower component than the current mandated level. Furthermore, there is evidence that multiple observations and VAM can work in concert to successfully identify top performers as well as laggards.

Policy Recommendations

Reverse the VAM requirement put forth in HB555
Reduce the overall proportion of student growth required in the teacher evaluation
Maintain flexibility to refine the evaluation system as needed – this is mostly new and unproven
Systematically solicit and incorporate large scale teacher input – efforts in this area have been at best inadequate

Some Value Added Research Resources from ASCD:

Using Value-Added Measures to Evaluate Teachers

Use Caution with Value-Added Measures

How Do Value-Added Indicators Compare to Other Measures of Teacher Effectiveness?

November 30, 2012 in Article

Via

Highlights

Value-added measures are positively related to almost all other commonly accepted measures of teacher performance such as principal evaluations and classroom observations.
While policymakers should consider the validity and reliability of all their measures, we know more about value-added than others.
The correlations appear fairly weak, but this is due primarily to lack of reliability in essentially all measures.
The measures should yield different performance results because they are trying to measure different aspects of teaching, but they differ also because all have problems with validity and reliability.
Using multiple measures can increase reliability; validity is also improved so long as the additional measures capture aspects of teaching we value.
Once we have two or three performance measures, the costs of more measures for accountability may not be justified. But additional formative assessments of teachers may still be worthwhile to help these teachers improve.

Introduction

In the recent drive to revamp teacher evaluation and accountability, measures of a teacher’s value added have played the starring role. But the star of the show is not always the best actor, nor can the star succeed without a strong supporting cast. In assessing teacher performance, observations of classroom practice, portfolios of teachers’ work, student learning objectives, and surveys of students are all possible additions to the mix.

All these measures vary in what aspect of teacher performance they measure. While teaching is broadly intended to help students live fulfilling lives, we must be more specific about the elements of performance that contribute to that goal – differentiating contributions to academic skills, for instance, from those that develop social skills. Once we have established what aspect of teaching we intend to capture, the measures differ in how valid and reliable they are in capturing that aspect.

Although there are big holes in what we know about how evaluation measures stack up on these two criteria, we can draw some important conclusions from the evidence collected so far. In this brief, we will show how existing research can help district and state leaders who are thinking about using multiple measures of teacher performance to guide them in hiring, development, and retention.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-other-measures/"]Continue reading...[/readon2]

How Stable are Value-Added Estimates

November 28, 2012 in Article

Via

Highlights:

A teacher’s value-added score in one year is partially but not fully predictive of her performance in the next.
Value-added is unstable because true teacher performance varies and because value-added measures are subject to error.
Two years of data does a meaningfully better job at predicting value added than does just one. A teacher’s value added in one subject is only partially predictive of her value added in another, and a teacher’s value added for one group of students is only partially predictive of her valued added for others.
The variation of a teacher’s value added across time, subject, and student population depends in part on the model with which it is measured and the source of the data that is used.
Year-to-year instability suggests caution when using value-added measures to make decisions for which there are no mechanisms for re-evaluation and no other sources of information.

Introduction

Value-added models measure teacher performance by the test score gains of their students, adjusted for a variety factors such as the performance of students when they enter the class. The measures are based on desired student outcomes such as math and reading scores, but they have a number of potential drawbacks. One of them is the inconsistency in estimates for the same teacher when value added is measured in a different year, or for different subjects, or for different groups of students.

Some of the differences in value added from year to year result from true differences in a teacher’s performance. Differences can also arise from classroom peer effects; the students themselves contribute to the quality of classroom life, and this contribution changes from year to year. Other differences come from the tests on which the value-added measures are based; because test scores are not perfectly accurate measures of student knowledge, it follows that they are not perfectly accurate gauges of teacher performance.

In this brief, we describe how value-added measures for individual teachers vary across time, subject, and student populations. We discuss how additional research could help educators use these measures more effectively, and we pose new questions, the answers to which depend not on empirical investigation but on human judgment. Finally, we consider how the current body of knowledge, and the gaps in that knowledge, can guide decisions about how to use value-added measures in evaluations of teacher effectiveness.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-stability/"]Continue reading...[/readon2]

Do Value-Added Methods Level the Playing Field for Teachers?

November 27, 2012 in Article

Via

Highlights

Value-added measures partially level the playing field by controlling for many student characteristics. But if they don't fully adjust for all the factors that influence achievement and that consistently differ among classrooms, they may be distorted, or confounded (An estimate of a teacher’s effect is said to be confounded when her contribution cannot be separated from other factors outside of her control, namely the the students in her classroom.)
Simple value-added models that control for just a few tests scores (or only one score) and no other variables produce measures that underestimate teachers with low-achieving students and overestimate teachers with high-achieving students.
The evidence, while inconclusive, generally suggests that confounding is weak. But it would not be prudent to conclude that confounding is not a problem for all teachers. In particular, the evidence on comparing teachers across schools is limited.
Studies assess general patterns of confounding. They do not examine confounding for individual teachers, and they can't rule out the possibility that some teachers consistently teach students who are distinct enough to cause confounding.
Value-added models often control for variables such as average prior achievement for a classroom or school, but this practice could introduce errors into value-added estimates.
Confounding might lead school systems to draw erroneous conclusions about their teachers – conclusions that carry heavy costs to both teachers and society.

Introduction

Value-added models have caught the interest of policymakers because, unlike using student tests scores for other means of accountability, they purport to "level the playing field." That is, they supposedly reflect only a teacher's effectiveness, not whether she teaches high- or low-income students, for instance, or students in accelerated or standard classes. Yet many people are concerned that teacher effects from value-added measures will be sensitive to the characteristics of her students. More specifically, they believe that teachers of low-income, minority, or special education students will have lower value-added scores than equally effective teachers who are teaching students outside these populations. Other people worry that the opposite might be true - that some value-added models might cause teachers of low-income, minority, or special education students to have higher value-added scores than equally effective teachers who work with higher-achieving, less risky populations.

In this brief, we discuss what is and is not known about how well value-added measures level the playing field for teachers by controlling for student characteristics. We first discuss the results of empirical explorations. We then address outstanding questions and the challenges to answering them with empirical data. Finally, we discuss the implications of these findings for teacher evaluations and the actions that may be based on them.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/level-playing-field/"]Continue reading...[/readon2]