reliability

How Do Value-Added Indicators Compare to Other Measures of Teacher Effectiveness?

Via

Highlights

  • Value-added measures are positively related to almost all other commonly accepted measures of teacher performance such as principal evaluations and classroom observations.
  • While policymakers should consider the validity and reliability of all their measures, we know more about value-added than others.
  • The correlations appear fairly weak, but this is due primarily to lack of reliability in essentially all measures.
  • The measures should yield different performance results because they are trying to measure different aspects of teaching, but they differ also because all have problems with validity and reliability.
  • Using multiple measures can increase reliability; validity is also improved so long as the additional measures capture aspects of teaching we value.
  • Once we have two or three performance measures, the costs of more measures for accountability may not be justified. But additional formative assessments of teachers may still be worthwhile to help these teachers improve.

Introduction

In the recent drive to revamp teacher evaluation and accountability, measures of a teacher’s value added have played the starring role. But the star of the show is not always the best actor, nor can the star succeed without a strong supporting cast. In assessing teacher performance, observations of classroom practice, portfolios of teachers’ work, student learning objectives, and surveys of students are all possible additions to the mix.

All these measures vary in what aspect of teacher performance they measure. While teaching is broadly intended to help students live fulfilling lives, we must be more specific about the elements of performance that contribute to that goal – differentiating contributions to academic skills, for instance, from those that develop social skills. Once we have established what aspect of teaching we intend to capture, the measures differ in how valid and reliable they are in capturing that aspect.

Although there are big holes in what we know about how evaluation measures stack up on these two criteria, we can draw some important conclusions from the evidence collected so far. In this brief, we will show how existing research can help district and state leaders who are thinking about using multiple measures of teacher performance to guide them in hiring, development, and retention.

[readon2 url="http://www.carnegieknowledgenetwork.org/briefs/value-added/value-added-other-measures/"]Continue reading...[/readon2]

Value-Added Versus Observations

Value-Added Versus Observations, Part One: Reliability

Although most new teacher evaluations are still in various phases of pre-implementation, it’s safe to say that classroom observations and/or value-added (VA) scores will be the most heavily-weighted components toward teachers’ final scores, depending on whether teachers are in tested grades and subjects. One gets the general sense that many – perhaps most – teachers strongly prefer the former (observations, especially peer observations) over the latter (VA).

One of the most common arguments against VA is that the scores are error-prone and unstable over time – i.e., that they are unreliable. And it’s true that the scores fluctuate between years (also see here), with much of this instability due to measurement error, rather than “real” performance changes. On a related note, different model specifications and different tests can yield very different results for the same teacher/class.

These findings are very important, and often too casually dismissed by VA supporters, but the issue of reliability is, to varying degrees, endemic to all performance measurement. Actually, many of the standard reliability-based criticisms of value-added could also be leveled against observations. Since we cannot observe “true” teacher performance, it’s tough to say which is “better” or “worse,” despite the certainty with which both “sides” often present their respective cases. And, the fact that both entail some level of measurement error doesn’t by itself speak to whether they should be part of evaluations.*

Nevertheless, many states and districts have already made the choice to use both measures, and in these places, the existence of imprecision is less important than how to deal with it. Viewed from this perspective, VA and observations are in many respects more alike than different.

[readon2 url="http://shankerblog.org/?p=5621"]Continue reading part I[/readon2]

Value-Added Versus Observations, Part Two: Validity

In a previous post, I compared value-added (VA) and classroom observations in terms of reliability – the degree to which they are free of error and stable over repeated measurements. But even the most reliable measures aren’t useful unless they are valid – that is, unless they’re measuring what we want them to measure.

Arguments over the validity of teacher performance measures, especially value-added, dominate our discourse on evaluations. There are, in my view, three interrelated issues to keep in mind when discussing the validity of VA and observations. The first is definitional – in a research context, validity is less about a measure itself than the inferences one draws from it. The second point might follow from the first: The validity of VA and observations should be assessed in the context of how they’re being used.

Third and finally, given the difficulties in determining whether either measure is valid in and of itself, as well as the fact that so many states and districts are already moving ahead with new systems, the best approach at this point may be to judge validity in terms of whether the evaluations are improving outcomes. And, unfortunately, there is little indication that this is happening in most places.

Let’s start by quickly defining what is usually meant by validity. Put simply, whereas reliability is about the precision of the answers, validity addresses whether we’re using them to answer the correct questions. For example, a person’s weight is a reliable measure, but this doesn’t necessarily mean it’s valid for gauging the risk of heart disease. Similarly, in the context of VA and observations, the question is: Are these indicators, even if they can be precisely estimated (i.e., they are reliable), measuring teacher performance in a manner that is meaningful for student learning?

[readon2 url="http://shankerblog.org/?p=5670"]Continue reading part II[/readon2]

New Gates Study on teacher evaluations

A new Gates study released today finds effective teacher evaluations require high standards, with multiple measures.

ABOUT THIS REPORT: This report is intended for policymakers and practitioners wanting to understand the implications of the Measures of Effective Teaching (MET) project’s interim analysis of classroom observations. Those wanting to explore all the technical aspects of the study and analysis also should read the companion research report, available at www.metproject.org.

Together, these two documents on classroom observations represent the second pair of publications from the MET project. In December 2010, the project released its initial analysis of measures of student perceptions and student achievement in Learning about Teaching: Initial Findings from the Measures of Effective Teaching Project. Two more reports are planned for mid-2012: one on the implications of assigning weights to different measures; another using random assignment to study the extent to which student assignment may affect teacher effectiveness results. ABOUT THE MET PROJECT: The MET project is a research partnership of academics, teachers, and education organizations committed to investigating better ways to identify and develop effective teaching. Funding is provided by the Bill & Melinda Gates Foundation.

The report provides for 3 takeaways.

High-quality classroom observations will require clear standards, certified raters, and multiple observations per teacher. Clear standards and high-quality training and certification of observers are fundamental to increasing inter-rater reliability. However, when measuring consistent aspects of a teacher’s practice, reliability will require more than inter- rater agreement on a single lesson. Because teaching practice varies from lesson to lesson, multiple observations will be necessary when high-stakes decisions are to be made. But how will school systems know when they have implemented a fair system? Ultimately, the most direct way is to periodically audit a representative sample of official observations, by having impartial observers perform additional observations. In our companion research report, we describe one approach to doing this.

Combining the three approaches (classroom observations, student feedback, and value-added student achievement gains) capitalizes on their strengths and offsets their weaknesses. For example, value-added is the best single predictor of a teacher’s student achievement gains in the future. But value-added is often not as reliable as some other measures and it does not point a teacher to specific areas needing improvement. Classroom observations provide a wealth of information that could support teachers in improving their practice. But, by themselves, these measures are not highly reliable, and they are only modestly related to student achievement gains. Student feedback promises greater reliability because it includes many more perspectives based on many more hours in the classroom, but not surprisingly, it is not as predictive of a teacher’s achievement gains with other students as value-added. Each shines in its own way, either in terms of predictive power, reliability, or diagnostic usefulness.

Combining new approaches to measuring effective teaching—while not perfect—significantly outperforms traditional measures. Providing better evidence should lead to better decisions. No measure is perfect. But if every personnel decision carries consequences—for teachers and students—then school systems should learn which measures are better aligned to the outcomes they value. Combining classroom observations with student feedback and student achievement gains on state tests did a better job than master’s degrees and years of experience in predicting which teachers would have large gains with another group of students. But the combined measure also predicted larger differences on a range of other outcomes, including more cognitively challenging assessments and student- reported effort and positive emotional attachment. We should refine these tools and continue to develop better ways to provide feedback to teachers. In the meantime, it makes sense to compare measures based on the criteria of predictive power, reliability, and diagnostic usefulness.

MET Gathering Feedback Practioner Brief