Measuring Up: What Educational Testing Really Tells Us is a 2008 book by Professor Daniel Koretz of Harvard. Koretz, the Henry Lee Shattuck Professor of Education at Harvard, is a noted expert on educational assessment and testing policy. Professor Koretz is both an excellent writer and also public speaker as evidenced by many short videos at BigThink and YouTube.
Measuring Up is an “accessible” book that tries and mostly succeeds in teaching the basic concepts, both statistical and testing specific, of educational achievement tests such as the SAT (the test formerly known as the Scholastic Aptitude Test), perhaps the most well known and important standardized test in the United States, with a minimum of mathematics and equations, relying on graphs and verbal descriptions instead. Some technical definitions such as the precise definition of the standard deviation are presented in footnotes. Measuring Up is based on a class with a similar goal that Professor Koretz gave at Harvard aimed at master’s degree students who need a good understanding of educational testing but lack the time and perhaps inclination to master the arcane statistics and mathematical methods used in the testing field. In many respects, most of us including parents, students, teachers, and public policy makers are in the same boat.
Professor Koretz’s views seem mostly moderate, middle-of-the-road, for lack of a better label. He supports educational testing and “accountability,” but expresses considerable frustration with over-reliance on tests and test scores and has a number of highly critical things to say about “high-stakes” testing and President Bush’s controversial No Child Left Behind (NCLB) education reform and the “Texas Miracle” that preceded it. It may also be noted that President Obama’s Race to the Top education initiative is actually pretty similar to Bush’s program, perhaps reflecting the agenda and beliefs of the financiers and businessmen who fund both political parties.
I found the book highly informative, especially the comparison of educational tests to political polls which made me think of tests in a way I usually do not. The key point is that a test, especially most standardized tests like the SAT, is actually a tiny sample of a large domain of knowledge that the student is supposed to learn and ideally master. In the same way that a poll of a few thousand people can accurately predict the votes of hundreds of millions of voters in a national election, an educational test attempts to evaluate mastery of a sometimes vast topic based on a small number, perhaps forty to eighty, of questions selected from the domain. It also means that a test can be highly misleading in much the same way that some polls famously predicted that Thomas Dewey would defeat Harry Truman in the Presidential election in 1948.
In particular, standardized tests are susceptible to being “gamed” if the test-taker or the test-taker’s teachers, parents, or others know the specific questions or types of questions on the test — or simply cheat in some way. Professor Koretz makes a moderately convincing case that many “high-stakes” tests such as the Texas Assessment of Academic Skills (TAAS) during George W. Bush’s tenure as governor have been gamed in some way; he calls this “score inflation.” If a test is “high-stakes,” meaning that something important such as admission to a selective college (the SAT) or funding for a school (TAAS) depends on the outcome of the test, there is a strong incentive to game the test, for example by “teaching to the test” or even outright cheating.
Although I like the book and highly recommend it, I have some serious reservations about some aspects of it. What follows is a discussion of these reservations, especially the book’s discussion of the decline in SAT scores from 1963 to 1980 as well as a discussion of some of the implications of the key points in Measuring Up on the now common practice of coding interviews in the computer industry, an extreme example of high-stakes testing.
The Mysterious Decline in SAT Scores
From 1963 until 1980, scores on the SAT declined significantly in the United States, especially on the verbal test. The SAT at this time consisted of two tests: a verbal test and a mathematics test. Scores were reported on a scale from 200 to 800. This scale had been established in 1941 and normalized to data from 1941 so that the mean for both tests in 1941 was 500 with a standard deviation in the distribution of student scores of 100. This means that in 1941 about two-thirds of students scored between 400 and 600 on both tests, about ninety-five percent scored between 300 and 700. The SAT was designed to yield a normal distribution, the Bell Curve, for student scores.
By 1981 the mean verbal SAT score had declined to 424 and the mean math score to 466. Both rebounded slightly during the 1980’s. The mean verbal SAT score rose back to 428, hardly much of an improvement although statistically significant, and the mean math score to 482 in 1995 when the scoring was rescaled so that the 428 became the new 500 for the verbal SAT and the 482 the new 500 for the math SAT, making historical comparisons more difficult for parents, students, and teachers with limited time to analyze the numbers. Everyone wins and everyone gets a prize 🙂
This decline was not limited to the SAT test as Professor Koretz discusses clearly in his book. In fact, most educational tests such as the National Assesssment of Educational Progress (NAEP) showed similar declines. The decline was widespread. It happened in both public and private schools and in Canada as well, at least suggesting something independent of US government policy. However, the SAT is the most well-known and probably the most important standardized educational test for Americans and the decline in SAT scores played a central role in the ensuing controversies.
In particular, political conservatives and education reformers, often critical of public schools and teacher’s unions, seized upon the decline in SAT scores. The decline played a central role in the Reagan administration’s A Nation at Risk report with its famous, widely quoted opening passage:
Our Nation is at risk. Our once unchallenged preeminence in commerce, industry, science, and technological innovation is being overtaken by competitors throughout the world. This report is concerned with only one of the many causes and dimensions of the problem, but it is the one that undergirds American prosperity, security, and civility. We report to the American people that while we can take justifiable pride in what our schools and colleges have historically accomplished and contributed to the United States and the well-being of its people, the educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation and a people. What was unimaginable a generation ago has begun to occur–others are matching and surpassing our educational attainments.
If an unfriendly foreign power had attempted to impose on America the mediocre educational performance that exists today, we might well have viewed it as an act of war. As it stands, we have allowed this to happen to ourselves. We have even squandered the gains in student achievement made in the wake of the Sputnik challenge. Moreover, we have dismantled essential support systems which helped make those gains possible. We have, in effect, been committing an act of unthinking, unilateral educational disarmament.
The decline in SAT and other scores from 1963 to 1980, and the relatively low scores since then when compared to 1941, is actually a very critical issue in both educational testing and educational policy in the United States. It is a touchy topic and I found the discussion of it in Measuring Up a bit confusing. Early in the book, Professor Koretz writes:
The effect of compositional changes can be exacerbated when test taking is voluntary, and the decline in SAT scores was worsened by a major compositional change: a large increase in the proportion of SAT-takers drawn from historically lower-scoring groups. As college attendance became more common, the proportion of high-school graduates electing to take admissions tests rose, and many of those newly added to the rolls were lower-scoring students. This was studied in considerable detail by the College Entrance Examination Board in the 1970s, and the research showed clearly that a sizable share of the drop in SAT scores was the result of this compositional change. Had the characteristics of the test-taking group remained constant, the decline would have been much smaller.
Daniel M Koretz. MEASURING UP (Kindle Locations 912-916). Kindle Edition.
Just to be clear, the liberal/teacher’s union explanation for the decline is that the SAT test in 1941 was taken primarily by rich white kids and by the 1960’s the test was being taken by non-rich, often non-white kids as well. No decline in teaching quality but rather an increase in opportunities, at least in part due to liberal reforms in the 1960’s and 1970’s. The technical term for this is “compositional changes.” Go Team Liberal! 🙂
The problem is what exactly is meant by “a sizable share of the drop in SAT scores”. Later, Professor Koretz writes:
The available evidence about specific hypothesized causes of the score trends is not sufficient to evaluate all of them, but it is adequate to rule out some of them and to estimate the size of the effects others might have had. The evidence suggests that a variety of both social and educational factors may have contributed to the trends but that no one factor can account for more than a modest share of the total. For example, by my estimate, changes in the demographic composition of the student population may have accounted for 10 or 20 percent of the decline and somewhat damped the subsequent increase in scores.
Daniel M Koretz. MEASURING UP (Kindle Locations 1400-1404). Kindle Edition.
These two passages certainly don’t seem consistent. Is 10 or 20 percent a “sizable share?” Most people probably mean a larger fraction when they use the term sizable share.
What is going on here? In the first passage, Professor Koretz is probably referring to a “blue ribbon” panel report produced for the College Board in 1977: On Further Examination: Report of the Advisory Panel on the Scholastic Aptitude Test Score Decline. This report actually has a lot of waffling in the fine print, but concludes:
Most-probably two-thirds to three-fourths-of the SAT score decline between 1963 and about 1970 was related to the “compositional” changes in the group of students taking this college entrance examination.
That was a period of major expansion in the number and proportion of students completing high school, resulting only in part from the post-World War II population wave, which came along then. The rest of the growth reflected the deliberate national undertaking during that period to expand and extend educational opportunity by reducing the high school drop-out rate, by trying to eliminate previous discrimination based on ethnicity or sex or family financial circumstance, and by opening college doors much wider.
BUT A FEW PARAGRAPHS LATER (waffle waffle):
From about 1970 on, the composition of the SAT-taking population has become comparatively more stabilized with respect to its economic, ethnic, and social background.
Yet the score decline continued and then accelerated; there were particularly sharp drops during the three-year period from 1972 to 1975. Only about a quarter of the decline since 1970 can be attributed to continuing change in the make-up of the test-taking group. With a handful of exceptions, the drop in scores in recent years has been virtually across the board, affecting high-scoring and lower-scoring groups alike.
Is a quarter (of the decline since 1970) a sizable fraction? In common usage sizable fraction tends to imply at least a half.
I think the actual state of affairs is that we don’t know beyond a reasonable doubt what caused the decline since 1941 and in fact modern SAT scores, properly scaled and adjusted for changes in the SAT tests, are on average lower than in 1941. As Professor Koretz writes later in the book, compositional effects probably contributed but something else must have happened as well. I tend to think Professor Koretz is tap-dancing around this because it has the potential to offend many interested parties. Compositional effects let parents, teachers, students, school administrators, politicians, the College Board, almost everybody off the hook.
Professor Koretz repeatedly makes the point that there is a strong incentive to game “high-stakes” tests such as the SAT in various ways. He cites a number of studies, including some of his own studies, that show evidence that this has happened. There is a pretty good case that the rescaling of scores on the SAT in 1994/1995 is an example of this. There are certainly legitimate statistical reasons for the rescaling, but it clearly has the effect of hiding the long term decline from cursory examination by busy parents, teachers, and students.
How Significant was the Decline in SAT Scores in the Real World?
Not very. History has spoken. The Berlin wall fell in 1989. Soviet troops pulled out of Eastern Europe, Afghanistan, and other regions. The Cold War ended — although it seems to be making a comeback lately.
A big concern in the 1980’s and early 1990’s was Japan. The menace of Japan whose students consistently outperform US students on average in comparisons of math and other educational performance measures appeared in popular culture, movies, best sellers such as Karel van Wolferen’s The Enigma of Japanese Power (I have a copy), and many other venues. Japan, superior test scores notwithstanding, faltered, experienced a financial crash, suffered economic stagnation, and is rarely cited as a concern today.
Competency versus Ranking
In the book, Professor Koretz exhibits a strong preference for tests used to compare individual students or groups, in fact those specifically designed to produce a normal distribution of scores, the Bell Curve, like the SAT rather than simple pass/fail tests like a driver’s license exam that are used to evaluate competency.
Competency and ranking tests are quite different. Professor Koretz gives a good example of the difference at the start of the book. He discusses a simple vocabulary test of forty words. We can choose the vocabulary words to be common words such as bed, travel, and carpet that anyone who knows English should know. In this case, most test takers, if they are competent English speakers, will get every or nearly every question. If someone scored less than ninety percent right on a test like this, we would rightly fail them and conclude they probably lack basic competence in English. A competency test often won’t have a normal distribution (Bell Curve) of scores.
Professor Koretz doesn’t like tests like this. He also does not like tests with obscure words like silliculose, vilipend, and epimysium that almost everyone will fail. Rather he likes vocabulary tests with words like feckless, disparage, and minuscule that some test takers will know, others will not, and that often produce a normal distribution, Bell Curve, of scores. He likes tests that enable us to compare individual students (Johnny has a larger vocabulary than Bobby) or groups (Harvard students have a larger vocabulary than Texas A&M students perhaps).
The problem with this emphasis on ranking and comparison is that one of the aims of education is not to identify the best students or groups of students. In mathematics, most students need to learn to balance their checkbook, evaluate prices in a store, formulate and track a personal or family budget, evaluate confusing statistics about medical products or school performance for their kids. 🙂 For these everyday activities most people, who are not professional mathematicians or something similar, need to be competent and they need tests that tell them and teachers whether they are competent — not the best and not better or worse than other people. This is what a driver’s exam is for. Most people would be non-plussed to receive an SAT-like score of 429 on a driver’s exam; what does that mean? We want to know whether someone can safely drive a car. Not only is the score reported differently (pass/fail) but the test is designed differently.
Competency is quite important. For example, if American farmers were incompetent, the United States would starve. If on the other hand, American farmers are competent but perhaps on average not quite as good as farmers in Japan, well — not really a big problem.
If American scientists and engineers were incompetent, indeed the nation would have been at risk in the 1980’s. But if American scientists and engineers were on average somewhat less good than Japanese scientists or engineers or than American scientists and engineers in 1941, not ideal but not really a big problem. In both cases, this inference about the relative quality of scientists and engineers is a big jump from differences in the test scores of K-12 students.
People from hyper-competitive environments like Harvard or Microsoft tend to confuse competency assessment and ranking. I mention Microsoft because of Bill Gates extensive activities in education and educational testing.
Case in point:
The Clueless CEO
Professor Koretz expresses exasperation with the attitude of some government officials and CEO’s involved in education reform to testing:
Early in his first term as president, George W. Bush, one of whose signature programs, No Child Left Behind, is built around testing, declared, ‘A reading comprehension test is a reading comprehension test. And a math test in the fourth grade-there’s not many ways you can foul up a test … It’s pretty easy to `norm’ the results.”‘ Whatever one thinks of No Child Left Behind-and there are good arguments both for and against various aspects of it-this claim was entirely wrong: it is all too easy to foul up the design of a test, and it is even easy to foul up in interpreting test scores.
And Bush is hardly alone in this mistaken view. A few years ago, a representative of a prominent business group addressed a meeting of the Board on Testing and Assessment of the National Research Council, of which I was then a member. She complained that her bosses-some of the most prominent CEOs in America engaged in education reform were exasperated because we in the measurement profession kept giving them far more complicated answers than they wanted. I responded that we gave them complex answers because the answers are in fact complex. One of her bosses had been the CEO of a computer company in which I then owned some stock, and I pointed out that my retirement savings would have taken a beating if that particular CEO had been foolish enough to demand only simple answers when his staff confronted him with problems of chip architecture or software design. She did not appear persuaded.
Daniel M Koretz. MEASURING UP (Kindle Locations 60-69). Kindle Edition.
Actually, from my own experience, it is quite conceivable that the CEO was foolish enough to demand only simple answers when his staff confronted him with problems of chip architecture or software design. 🙂 In defense of this assertion, it may be pointed out that successful high tech companies also have a high failure rate. Many don’t stay successful that long. Witness, for example, Blackberry, the once King of the smartphone market.
The computer industry is the land of the sixty second elevator pitch. High tech startup companies are routinely expected to present their “pitch” in a ten minute, ten slide Powerpoint “slide deck.” If you can’t make your point convincing in the forty character subject line of an e-mail read on an iPhone, you aren’t executive material.
In defense of the CEO, running even a small high tech business is a ton of work. The CEO must handle hundreds of issues simultaneously: sales, marketing, finance, interpersonal squabbles, numerous technical issues and so on. The CEO is usually the public face of the company and must spend an enormous amount of time on the road meeting investors and potential investors, major customers and potential customers, attending key trade shows and conferences, and many other public relations activities. Most good CEOs try to limit the amount of work they have to deal with directly and to select subordinates who can either deal with the issues or simplify them to PowerPoint bullets so the CEO can make a simple, quick decision — one of hundreds or even thousands the CEO must make. A simple, reliable metric like a standardized test score can be a Godsend to a harried executive.
Unfortunately, as Professor Koretz notes, some issues just can’t be simplified to a single metric or a few bullet points on a PowerPoint slide. They are inherently complex. The seductive appeal of a simple but wrong answer remains.
CEO’s of computer companies don’t just promote high stakes testing for school teachers and students. They are eating their own dog food. The current fad in job interviews in the computer industry is the “coding interview.” Coding interviews come in two main flavors. One is a grueling several hour interview answering algorithm and coding questions drawn from the subject area of the job. The other flavor is a grueling several hour interview answering algorithm and coding questions drawn from algorithms courses and books in the computer science (CS) curriculum that almost no one actually uses in the real world. Google, which has a reputation for hiring most employees right out of school, is noted for this latter kind of coding interview.
In software design orthodoxy, software engineers are supposed to reuse code — not reinvent the wheel. Most of the algorithms taught in CS classes at colleges and universities were figured out long ago and are incorporated in widely available libraries and programming languages. Consequently, practicing software engineers rarely need to know how to implement these algorithms. In fact, software engineers developing or implementing cutting edge algorithms generally spend their time working on algorithms not taught in school. Surprise, surprise.
A substantial proportion of actual software engineers are not formally trained in computer science. For example, many have degrees in other STEM (Science, Technology, Engineering, and Math) fields. A fair number are self-taught or college, even high school dropouts, including such noted figures as Bill Gates of Microsoft and Jan Koum of WhatsApp. Members of under-represented minority groups are especially likely to be self-taught and/or dropouts. Even practicing software engineers with formal CS training tend to forget the school algorithm courses that they rarely or never use. Thus these coding interviews tend to act like Koretz’s vocabulary test populated with words like silliculose, vilipend, and epimysium (not that bad, but the point remains valid).
With respect to Koretz’s “score inflation,” a coding interview is clearly a high stakes test — with a job, often a high paying desirable job at stake. Not surprisingly, there are extensive efforts to “game” the coding interviews. There are dozens of books on how to ace a coding interview at Amazon including the market leader Gayle Laakmann McDowell’s Cracking the Coding Interview: 150 Programming Questions and Solutions (5th Edition). The author, a former engineer at Google and other big name tech companies, has her own company CareerCup with several books, videos and other resources for interviewing for jobs at high tech companies. There is even a popular meetup group in the Silicon Valley for practicing coding interviews: https://www.meetup.com/Coding-Interview-Practice/
The college level CS algorithm coding interviews also exhibit the confusion between competency and ranking common in the highly competitive computer industry. The main reason for testing basic skills taught in school is to confirm basic competency. As it happens, in this case, these basic skills taught in school are not necessary to program (and program well) in the real world since real world programmers reuse implementations of the algorithms in common libraries and programming languages. In practice, the companies compare candidates based on how well they do on these unrepresentative tests — for ranking rather than competence evaluation. This is highly unlikely to produce desirable results unless the company limits itself to recent computer science graduates and maybe not even then. The basic CS algorithm coding interviews are neither good competency nor good ranking tests.
Read Measuring Up but take the book with a grain of salt. So too, use standardized tests but invest in careful design of the tests and exercise caution in using the results of the tests, combining them with other information and criteria as Professor Koretz recommends.
© 2015 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at [email protected].