What are valid outcome measures in education research?

May 29, 2025

I wanted to follow up on the article I published two weeks ago, Is ChatGPT Really Ruining Education?. There has been some controversy surrounding some of the articles mentioned in that article.

In particular, Ben Riley wrote Something rotten in AI research, and poked specifically at two of the papers I mentioned.

Nature study

First, we went into great detail in that podcast episode about the Nature meta-analysis which came out earlier this month. The authors reviewed 51 other studies and concluded that ChatGPT had signigficant results. As Ben points out:

These are stunning results for an education intervention—trust me when I say an effect size of .867 on student learning is massive.

However, he then quotes an email from the founder of Carnegie Learning, who calls these results into question – mainly because the authors misclassified some of the studies they analyzed (they listed one as addressing secondary students when in fact it was 3rd grade teachers who used ChatGPT). I agree that the studies were a mixed bag. However, Steve Ritter then concludes “Given that one paper I looked at is so clearly misclassified, I’m not sure I trust anything in the meta-analysis.”. I wouldn’t go so far - I think that the papers they analyzed varied in quality, but ultimately there is some evidence there that there is huge benefit to use of chatbots.

And we’re just getting started.

Tutor CoPilot

Ben points out a more serious issue, though, with the Stanford Tutor CoPilot study. You can see a nice description of its impact by the authors here:

Ben says that the authors originally pre-registered one outcome, but then swapped another when they went to analyze the paper.

What I am prepared to say is that the authors of this study presented their findings in a way that deliberately plays up the positive impact of AI, while minimizing the evidence that calls into question the value of using AI to enhance tutoring. And I find this troubling. To understand why requires diving a bit into the research weeds, so please bear with me.

He does go into the weeds, concluding that the researchers had pre-registered NWEA MAP scores as their primary measure, but then used exit tickets instead in a switcheroo.

In my view, to later feature exit-ticket data because it provides modest evidence of a positive AI effect contravenes the spirit if not the letter of the pre-registration process, which after all is to prevent data hacking to produce novel results.

I found this pretty alarming so I looked at the preregistration, available here. He is technically correct - but when I look at the actual doc, it looks like the authors threw in both NWEA MAP and exit tickets, and didn’t give that much thought to exactly how they would be analyzed. In some other studies, for example, authors will include the exact outcome measures they plan to use, even the scripts they will run to analyze them. In this case, though, the authors were mainly focused on how students felt about using the AI tutor, and what language they used - the learning gains were sort of an after thought.

I don’t think this disqualifies this study, however it does call out why we need to have additional studies with more rigorous preregistration and analysis in the future.