This paper contains a sizable collection of testing errors made in the last twenty-five years. It thus offers testimony to counter the implausible demands of educational policy makers for a single, error-free, accurate, and valid test used with large groups of children for purposes of sorting, selection, and trend-tracking.
No company can offer flawless products. Even highly reputable testing contractors that offer customers high-quality products and services produce tests that are susceptible to error. But while a patient dissatisfied with a diagnosis or treatment may seek a second or third opinion, for a child in a New York City school (and in dozens of other states and hundreds of other cities and towns), there is only one opinion that counts – a single test score. If that is in error, a long time may elapse before the mistake is brought to light – if it ever is.
This paper has shown that human error can be, and often is, present in all phases of the testing process. Error can creep into the development of items. It can be made in the setting of a passing score. It can occur in the establishment of norming groups, and it is sometimes found in the scoring of questions.
[…]
Measuring trends in achievement is an area of assessment that is laden with complications. The documented struggles experienced by the National Center for Education Statistics (NCES) and Harcourt Educational Measurement testify to the complexity inherent in measuring changes in achievement. Perhaps such measurement requires an assessment program that does only that. The National Center of Educational Statistics carefully tries to avoid even small changes in the NAEP tests, and examines the impact of each change on the test’s accuracy. Many state DOEs, however, unlike NCES, are measuring both individual student achievement and aggregate changes in achievement scores with the same test – a test that oftentimes contains very different questions from administration to administration. This practice counters the hard-learned lesson offered by Beaton,“If you want to measure change, do not change the measure”(Beaton et al., 1990, p. 165).
Furthermore, while it is a generally held opinion that consumers should adhere to the advice of the product developers (as is done when installing an infant car seat or when taking medication), the advice of test developers and contractors often goes unheeded in the realm of high-stakes decision-making. The presidents of two major test developers – Harcourt Brace and CTB McGraw Hill – were on record that their tests should not be used as the sole criterion for making high-stakes educational decisions (Myers, 2001; Mathews, 2000a). Yet more than half of the state DOEs are using test results as the basis for important decisions that, perhaps, these tests were not designed to support.
Finally, all of these concerns should be viewed in the context of the testing industry today. Lines (2000) observed that errors are more likely in testing programs with greater degrees of centralization and commercialization, where increased profits can only be realized by increasing market share,“The few producers cannot compete on price, because any price fall will be instantly matched by others .... What competition there is comes through marketing”(p. 1). In Minnesota, Judge Oleisky (Kurvers et al. v. NCS, Inc., 2002) observed that Basic Skills Test errors were caused by NCS’ drive to cut costs and raise profits by delivering substandard service – demonstrating that profits may be increased through methods other than marketing.