Skip to main content

This began as a comment in the diary Teacher Ken posted re: a number of well-regarded educators coming to the support of teachers in WA state who are boycotting their district's mandatory standardized test (it's worth a read - go look).

That particular test is especially egregious, because while it is supposed to be designed exclusively for the district, and based on the district's curriculum, it is not. The test seems almost specifically designed to force the district to fail. Of course that's conjecture on my part, but it would certainly fit the well with the goals of the people who most strongly push the testing mantra: those who want to see more for-profit schools funded by taxpayer dollars at the expense of public schools. Since the test was foisted upon the district by a former administrator who had been fired for incompetence, well, it's not really far-fetched to expect that revenge against the district could have played a small role in choosing the test.

But the MAP test isn't the only high-stakes test out there. There are plenty of them clogging up school systems throughout the country. These tests are highly profitable, which is part of the reason the Bush Administration put so much effort into making them a national requirement. After all, his own brother owns Ignite, a company that provides standardized test preparation services to students.

One of the arguments encouraging us to let the Bush brother connection slide at the time was along the lines of, "Eh, what's a little conflict of interest, when our children's futures are at stake?".

Well, our children's futures were at stake, not because we weren't testing them with these magical, highly-profitable, future-predicting tests, but because we were about to launch the greatest offense on public education the nation had ever seen - by implementing these tests.

Follow me below the orange croissant for some meaty goodness ....

It is well-documented that the test results are falsified and/or entirely useless, both at the local level (watch for the annual stories about the teacher who is fired when he/she has been caught replacing student answers in order to improve scores) and via the testing companies themselves.

While a brief, honest test of basics could give teachers a handle on areas where individual students could use some additional tutoring, these day-long or multi-day tests containing biased questions and graded dishonestly do nothing of any use to society. They are simply siphons to suck public funding into private pockets.

Here's an excerpt from an essay awarded the highest score by the Educational Testing Service's grader:


In today's society, college is ambiguous. We need it to live, but we also need it to love. Moreover, without college most of the world's learning would be egregious. College, however, has myriad costs. One of the most important issues facing the world is how to reduce college costs. Some have argued that college costs are due to the luxuries students now expect. Others have argued that the costs are a result of athletics. In reality, high college costs are the result of excessive pay for teaching assistants.

    I live in a luxury dorm. In reality, it costs no more than rat infested rooms at a Motel Six. The best minds of my generation were destroyed by madness, starving hysterical naked, and publishing obscene odes on the windows of the skull. Luxury dorms pay for themselves because they generate thousand and thousands of dollars of revenue. In the Middle Ages, the University of Paris grew because it provided comfortable accommodations for each of its students, large rooms with servants and legs of mutton. Although they are expensive, these rooms are necessary to learning. The second reason for the five-paragraph theme is that it makes you focus on a single topic. Some people start writing on the usual topic, like TV commercials, and they wind up all over the place, talking about where TV came from or capitalism or health foods or whatever. But with only five paragraphs and one topic you're not tempted to get beyond your original idea, like commercials are a good source of information about products. You give your three examples, and zap! you're done. This is another way the five-paragraph theme keeps you from thinking too much.

This was written by an MIT writing professor, specifically to test the grading system that is providing make-or-break scores on our children's futures.

That horrifying essay's score was provided by a scoring robot - an algorithm that uses some basic rules to determine how "good" the essay is. But not all tests are scored by robots - surely human scorers do better, right? Nope:

   The study, funded by the William and Flora Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short essays, written by students in junior high schools and high school sophomores, to the ratings given to the same essays by trained human readers.

    The differences, across a number of different brands of automated essay scoring software (AES) and essay types, were minute. “The results demonstrated that over all, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items,” the Akron researchers write, “with equal performance for both source-based and traditional writing genre.”

The essay is make-or-break in admissions decisions at a number of high-level schools. And it's easy to game, at least if you know the rules. If you know any college-bound high school students, send them this. They can thank you later. Note: the math section can also be gamed, just by learning what tricks they use to try to cause students to give the wrong answer, whether or not the student actually knows how to do the math. But that section is scored entirely by bots. The essay is where human graders have come clean about the cheating.

But, but, but these tests tell us how people will perform in real life, right? They help us sort the winners from the losers, allowing the winners greater opportunity to make something of themselves in the world, while preventing the losers from wasting their time and money on expensive college educations that won't do them any good. (Or so the mantra goes). But, um, no:

   Roach took the FCAT himself last year and failed dismally, getting wrong 84 percent of the math questions and only scoring 62 percent on the writing portion, which would get him a “mandatory assignment to a double block of reading instruction,” according to Roach.

    “It seems to me something is seriously wrong. I have a bachelor of science degree, two masters’ degrees, and 15 credit hours toward a doctorate. I help oversee an organization with 22,000 employees and a $3 billion operations and capital budget, and am able to make sense of complex data related to those responsibilities.

    “It might be argued that I’ve been out of school too long, that if I’d actually been in the 10th grade prior to taking the test, the material would have been fresh. But doesn’t that miss the point? A test that can determine a student’s future life chances should surely relate in some practical way to the requirements of life. I can’t see how that could possibly be true of the test I took,” wrote Roach.

But wait, there's more! The numbers assigned are, by and large, faked to create an appearance of consistency from year to year within a school district. It doesn't matter if students actually do better, because the scores are not allowed to deviate from the statistical norm for the school system:
   What is the work itself like? In test-scoring centers, dozens of scorers sit in rows, staring at computer screens where students’ papers appear (after the papers have undergone some mysterious scanning process). I imagine that most students think their papers are being graded as if they are the most important thing in the world. Yet every day, each scorer is expected to read hundreds of papers. So for all the months of preparation and the dozens of hours of class time spent writing practice essays, a student’s writing probably will be processed and scored in about a minute.


    There is a common fantasy that test scorers have some control over the grades they are giving. ...[snip]...  Usually, within a day or two, when the scores we are giving are inevitably too low (as we attempt to follow the standards laid out in training), we are told to start giving higher scores, or, in the enigmatic language of scoring directors, to “learn to see more papers as a 4.” For some mysterious reason, unbeknownst to test scorers, the scores we are giving are supposed to closely match those given in previous years. So if 40 percent of papers received 3s the previous year (on a scale of 1 to 6), then a similar percentage should receive 3s this year. Lest you think this is an isolated experience, Farley cites similar stories from his fourteen-year test-scoring career in his book, reporting instances where project managers announced that scoring would have to be changed because “our numbers don’t match up with what the psychometricians [the stats people] predicted.” Farley reports the disbelief of one employee that the stats people “know what the scores will be without reading the essays.”

    I also question how these scores can possibly measure whether students or schools are improving. Are we just trying to match the scores from last year, or are we part of an elaborate game of “juking the stats,” as it’s called on HBO’s The Wire, when agents alter statistics to please superiors? For these companies, the ultimate goal is to present acceptable numbers to the state education departments as quickly as possible, beating their deadlines (there are, we are told, $1 million fines if they miss a deadline). Proving their reliability so they will continue to get more contracts.

Why do they "juke the stats"? Because if the stats aren't consistent from year to year, states are less likely to be willing to pay for the same test the next time, because they believe that the tests are bad, rather than that different groups of students could possibly have different scores. Thus, to ensure continued profitability, the testing companies must cheat.

The goals of these tests are entirely unrelated to education - at least on the creation/grading end. It doesn't matter what the state or local school system thinks they're trying to accomplish with the testing. Parents and school systems have bought a testing pig in a poke, and if they ever actually open the bag, they'll discover the pig is actually "Blinky" the three-eyed fish from the Simpsons.

Originally posted to Radical Simplicity on Tue Jan 22, 2013 at 10:57 AM PST.

Also republished by Education Alternatives.

Your Email has been sent.
You must add at least one tag to this diary before publishing it.

Add keywords that describe this diary. Separate multiple keywords with commas.
Tagging tips - Search For Tags - Browse For Tags


More Tagging tips:

A tag is a way to search for this diary. If someone is searching for "Barack Obama," is this a diary they'd be trying to find?

Use a person's full name, without any title. Senator Obama may become President Obama, and Michelle Obama might run for office.

If your diary covers an election or elected official, use election tags, which are generally the state abbreviation followed by the office. CA-01 is the first district House seat. CA-Sen covers both senate races. NY-GOV covers the New York governor's race.

Tags do not compound: that is, "education reform" is a completely different tag from "education". A tag like "reform" alone is probably not meaningful.

Consider if one or more of these tags fits your diary: Civil Rights, Community, Congress, Culture, Economy, Education, Elections, Energy, Environment, Health Care, International, Labor, Law, Media, Meta, National Security, Science, Transportation, or White House. If your diary is specific to a state, consider adding the state (California, Texas, etc). Keep in mind, though, that there are many wonderful and important diaries that don't fit in any of these tags. Don't worry if yours doesn't.

You can add a private note to this diary when hotlisting it:
Are you sure you want to remove this diary from your hotlist?
Are you sure you want to remove your recommendation? You can only recommend a diary once, so you will not be able to re-recommend it afterwards.
Rescue this diary, and add a note:
Are you sure you want to remove this diary from Rescue?
Choose where to republish this diary. The diary will be added to the queue for that group. Publish it from the queue to make it appear.

You must be a member of a group to use this feature.

Add a quick update to your diary without changing the diary itself:
Are you sure you want to remove this diary?
(The diary will be removed from the site and returned to your drafts for further editing.)
(The diary will be removed.)
Are you sure you want to save these changes to the published diary?

Comment Preferences

  •  absurb illogic (0+ / 0-)

    You take one example of a paper specifically designed to fool the scoring engine.  then you say that because in a large well designed study, scoring engines matched human scores, that means human scores are flawed.

    You are wrong.

    Just because a professor can deliberately set out to fool the scoring engine does not mean that all students do it or could do it.  Well designed studies show that scoring engines agree with human scores as much or more than humans agree with humans.

    Human grading is well designed and well implemented.  Large scale essay grading is more objective and consistent than grading by a single person.

    •  Actually, no (1+ / 0-)
      Recommended by:
      Linda Wood

      I posted links to everything in the diary, including the story that details that the professor's students were able to write an app to create nonsense that fools the bots every time.

      The study regarding scores by humans was undertaken to prove that the bots do as well as the humans, who are taught to grade by the same rules that were used to create the bots - it was a validation of the bots' accuracy.

      The claim that human graiding is well designed and well implemented is belied by the fact that the bots give the same scores that humans give. If you then read the articles by actual human graders, you will find that they are given specific rules to use for grading, which do not take quality into account, and are given approximately 60 seconds per essay to issue a score on that essay. To claim that a reader is going to read 2 pages of hand-written text, vette it for clarity and accuracy, and score it based on anything other than the surface criteria provided (use of certain words, including quotes from famous sources, etc.) is sadly misinformed about the way the testing system works in real life.

      The diary is an overview of the failure modes.  I provided links to back up everything in this diary. You may find the content at the other ends of the links to be enlightening. An exhaustive critique would require a book, not a diary. Here's one: Making the Grades, written by one of the graders, who advanced through the testing industry to see the way it works, from top to bottom.

      The system they use today is not the system that was in place 30 or 40 years ago, which was used to determine the best classes to which to assign students in the upcoming grade. Those were useful. The new tests are not.

      Do you know what the number 1 predictor of your essay score is? Length. The longer the essay, the better the grade. Period.

      In the next weeks, Dr. Perelman studied every graded sample SAT essay that the College Board made public. He looked at the 15 samples in the ScoreWrite book that the College Board distributed to high schools nationwide to prepare students for the new writing section. He reviewed the 23 graded essays on the College Board Web site meant as a guide for students and the 16 writing "anchor" samples the College Board used to train graders to properly mark essays.

      He was stunned by how complete the correlation was between length and score. "I have never found a quantifiable predictor in 25 years of grading that was anywhere near as strong as this one," he said. "If you just graded them based on length without ever reading them, you'd be right over 90 percent of the time." The shortest essays, typically 100 words, got the lowest grade of one. The longest, about 400 words, got the top grade of six. In between, there was virtually a direct match between length and grade.

      He was also struck by all the factual errors in even the top essays.


      Dr. Perelman contacted the College Board and was surprised to learn that on the new SAT essay, students are not penalized for incorrect facts. The official guide for scorers explains: "Writers may make errors in facts or information that do not affect the quality of their essays. For example, a writer may state 'The American Revolution began in 1842' or ' "Anna Karenina," a play by the French author Joseph Conrad, was a very upbeat literary work.' " (Actually, that's 1775; a novel by the Russian Leo Tolstoy; and poor Anna hurls herself under a train.) No matter. "You are scoring the writing, and not the correctness of facts."

      •  correlation is not causation (0+ / 0-)

        Length of essay responses is correlated to quality factors.  Short essays are typically not very good because they are not well developed.

        As to scoring criteria, those are clearly stated upfront.  That is why large scale scoring is more reliable than individual scoring.

        Do you know about how human scorers are constantly assessed each day and throughout the day to ensure that they are following the scoring guide?  Do you know they must pass calibration assessments at least once a day ?  do you know they typically get blind "check" papers throughout the day to confirm that they are scoring correctly?

        •  Clearly, you're not reading the linked content (0+ / 0-)

          Yes, actually, I do know about the assessments, the entire process is well-described by the graders who wrote the articles and books to which I linked.

          The assessment does nothing to ensure that essays are graded in a way that has anything to do with quality, but rather ensures that they are scored according to the rules on which the graders are trained, and on which the bots are based, and which have exactly nothing to do with the student's ability to write well and accurately. Those rules let essays such as the one Dr. Perelman wrote (and other nonsense essays) get excellent scores, while high quality, but succinct essays fail.  

          For some reason, I get the feeling you're employed by one of these companies.

          •  you assume that individual (0+ / 0-)

            graders understand how the entire system works.

            You assume that the "rules" are meaningless.

            Your assumption about me is wrong as well.

            •  Showing once again that you have not read (0+ / 0-)

              ... the source material.

              One of the graders was promoted through to management (the one who wrote the book). He does know how the system works from the top down. In addition, it's very clear from the kinds of essays that get good marks, that the grading system rules are meaningless in terms of determining whether or not a student knows how to write a quality, meaningful, concise essay. A brilliant, accurate, but concise essay will lose points due to brevity.

              I have provided data to back up the information in this diary. You have chosen not to review it. In addition, you have chosen not to provide any data to back up your assertions. If you have data to back up your claims, you are free to cite the sources. Otherwise, it is pointless to continue this thread, since you are choosing not to participate in a dialog based on the data, but rather to provide an apologia.

  •  Thank you for the Bush references. (0+ / 0-)

    Thank you for connecting the most corrupt influences in our country to this reform debate. I write this as a person who believes reform has been called for in public education for decades, especially at the primary level, but who also believes the opportunity for corruption of the efforts to reform is always present, especially whenever anyone associated with the Bush people are involved.

    What I struggle with in the debate here at Daily Kos is the over-simplification of the subjects we're debating, so that "standardized testing" is condemned, when most of us don't really know what is meant, at any given moment, by that term.

    Therefore I also really appreciate your giving concrete examples of some of the horrors possible in the testing business. I find both the robotized process and the human process of evaluation, as you've shown them here, to be shocking, nuts, and worth talking about more at Daily Kos.

Subscribe or Donate to support Daily Kos.

Click here for the mobile view of the site