The "jagged frontier" paper: how not to evaluate machine learning

tl;dr: the “jagged technological frontier” paper doesn’t show that AI improves the quality of people’s work. The grading rubric has serious issues, and the numbers are averaged so many times that it’s no longer clear what they mean.

There’s a working paper making the rounds called Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. One of the reasons it’s so popular is that it claims to quantify how much workers’ “productivity and quality” improves if they’re able to use GPT-4. Those results are starting to show up elsewhere in AI discourse, such as Google Cloud’s recent post entitled Prompt engineering and you: How to prepare for the work of the future.

I’ve read the paper a few times, and I’m still not convinced. It’s a fairly long paper, and there’s a lot to unpack. But for now, I’m just going to focus on evaluation.

The tasks

The paper includes many different tasks, which makes things a little bit complicated. So I’m going to focus on the tasks which are getting the most hype. Participants did the following writing tasks, which I’m quoting from Appendix A of the paper:


  1. Generate ideas for a new shoe aimed at a specific market or sport that is underserved. Be creative, and give at least 10 ideas.
  2. Pick the best idea, and explain why, so that your boss and other managers can understand your thinking.
  3. Describe a potential prototype shoe in vivid detail in one paragraph (3-4 sentences).
  4. Come up with a list of steps needed to launch the product. Be concise but comprehensive.
  5. Come up with a name for the product: consider at least 4 names, write them down, and explain the one you picked.
  6. Use your best knowledge to segment the footwear industry market by users. Keep it general, and do not focus yet on your specific target and customer groups.
  7. List the initial segments might you consider (do not consider more than 3).
  8. List the presumed needs of each of these segment. Explain your assessment.
  9. Decide which segment is most important. Explain your assessment.
  10. Come up with a marketing slogan for each of the segments you are targeting.
  11. Suggest three ways of testing whether your marketing slogan works well with the customers you have identified.
  12. Write a 500-word memo to your boss explaining your findings.
  13. Your boss would like to test the idea with a focus group. Please, describe who you would bring into this focus group.
  14. Suggest 5 questions you would ask the people in the focus group.
  15. List (potential) competitor shoe companies in this space.
  16. Explain the reasons your product would win this competition in an inspirational memo to employees.
  17. Write marketing copy for a press release.
  18. Please, synthesize the insights you have gained from the previous questions and create an outline for a Harvard Business Review-style article of approximately 2,500 words. In this article, your goal should be to describe your process end-to-end so that it serves as a guide for practitioners in the footwear industry looking to develop a new shoe. Specifically, in this article, please describe your process for developing the new product, from initial brainstorming to final selection, prototyping, market segmentation, and marketing strategies. Please also include headings, subheadings, and a clear structure for your article, which will guide the reader through your product development journey and emphasize the key takeaways from your experience. Please also share lessons learned and best practices for product development in the footwear industry so that your article serves as a valuable resource for professionals in this field.

Source: list of tasks from Appendix A of the “jagged frontier paper”


Some participants did those tasks without GPT, some with GPT, and some with GPT plus some instruction on how to use GPT. The idea is to see whether the participants who used GPT did better on the task.

The grades don’t make sense

Here’s the grading rubric, again quoted from Appendix A of the paper. I’m reformatting it a little bit. In the article, the rubric is a low-resolution screenshot, which is difficult to read. So I’ve copied the text into a table.

The rubric included some examples, but I’m leaving them out here. It’s not clear what task the examples come from, and they aren’t explained at all. So they make the rubric more confusing.


Score Description
1
  • Participant does not identify tactical actions for client to boost profits
2-4
  • Participant alludes to recommendations on how to boost profit but does not explicitly call out tactical actions
  • Description of actions lacks specificity. Little to no description of business reasoning and impact on profit
5
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant does not describe how to implement strategy. Explanation lacks specificity
  • Business reasoning unclear
6-8
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant describes tactical actions in detail and outlines how to implement
  • Business reasoning lacks clarity. Participant does not fully connect explanation to client concerns. Impact on profit unclear.
9-10
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant includes elaborate description of how to implement tactical actions
  • Actions backed up by sound business reasoning. Participant draws on historical pain points and belief audits
  • Participant outlines impact of action on profit
  • Points deducted if explanation is incomplete

Source: list of tasks from Appendix A of the “jagged technological frontier” paper


The first thing I see is that there’s a mismatch between the rubric and many of the tasks. It’s not clear how a grader can apply this rubric to a task like “Write marketing copy for a press release.” 

But I think the most serious issue is the grades themselves. It’s confusing that sometimes a row includes a single number, such as “1” or “5”, and other times it includes a range, such as “6-8” or “9-10.” If a response falls somewhere in the “2-4” range, how did graders decide to assign 2, 3, or 4 points? 

The other problem is that the numbers themselves don’t really mean anything.  In this grading scheme, there’s a sense that higher numbers indicate “better” responses, but it’s not clear how much “better” 2 is than 1, or how much “better” 8 is than 7, and so on. Is a response that gets a score of 2 really “twice as good” as a response that gets a score of 4? If so, what does it mean to be “twice as good”?

In fact, we could just as easily replace the numbers with other descriptors, such as in the following table:


Score Description
Poor
  • Participant does not identify tactical actions for client to boost profits
Fair
  • Participant alludes to recommendations on how to boost profit but does not explicitly call out tactical actions
  • Description of actions lacks specificity. Little to no description of business reasoning and impact on profit
Good
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant does not describe how to implement strategy. Explanation lacks specificity
  • Business reasoning unclear
Excellent
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant describes tactical actions in detail and outlines how to implement
  • Business reasoning lacks clarity. Participant does not fully connect explanation to client concerns. Impact on profit unclear.
Outstanding
  • Participant identifies tactical actions for client to boost profits
  • Tactical actions are aligned with overall channel strategy
  • Participant includes elaborate description of how to implement tactical actions
  • Actions backed up by sound business reasoning. Participant draws on historical pain points and belief audits
  • Participant outlines impact of action on profit
  • Points deducted if explanation is incomplete

Adapted from list of tasks from Appendix A of the “jagged technological frontier” paper


That suggests that the numbers themselves don’t matter – they’re just codes that we’re using for convenience. That becomes a problem when we try to do “mathy” things to the numbers, as we’ll see below.

We don’t know much about the graders, but I would be very curious to see how often graders agreed on the scores they assigned. Typically, for projects like this, researchers need to have several sessions with graders, teaching them how the grading works, updating the rubric based on grader feedback, and figuring out what to do when graders disagree. That kind of work is a major field of study in its own right. But we don’t get any details about that in the paper.

The averages don’t make sense

So overall it seems like the numbers are not “really” numbers, but just codes for something. That’s a problem when you try to do “mathy” things to the numbers, such as taking averages. If you have a number like 4.5 or 5.67, what does that mean? How much better is a score of 4.6 compared to 4.5? What does that mean in terms of the rubric? It’s not clear.

Here’s a similar example. Let’s say I ask 100 of my friends how much they like ice cream on a scale of 1 to 5, where 1 means “I hate ice cream” and 5 means “I love ice cream.” If I say that my friends’ average feeling about ice cream is 3.75, what does that mean? Do I have some friends who love ice cream and some who hate it? Or do most of my friends just feel kind of “meh” about ice cream? It would make more sense if I just showed you the response counts, and maybe made some bar charts.

The problem is that the authors of this paper did take a bunch of averages. Specifically:

  • 758 participants completed each of the 18 tasks, possibly not finishing some of them.
  • Each of the completed tasks was graded by two graders. 
  • The authors took the mean of those two grades.
  • The authors then took those mean grades, and averaged them across all the questions, to produce the composite “Quality” score.

That’s a lot of numbers getting arbitrarily smooshed together. In the end, it’s not clear what those composite scores actually mean, or indeed if they mean anything. This is a serious problem, because those numbers are getting widely repeated in AI circles to support claims such as “AI can help low-performing workers do better work.” Which is a problematic claim in lots of other ways, but that’s a topic for another post. For now it’s enough to say that even if we assume that’s a reasonable claim to investigate, the numbers from this experiment don’t prove that.

By the way, I’m intentionally not including the results here, because of something a friend and colleague told me. We often condense complicated results to simple numbers and easy-to-read charts because it makes the information more memorable and easier to understand. And that’s not necessarily a bad thing. But as my friend says, even when the numbers and charts are wrong or misleading, people are often still going to remember them and keep talking about them. That can cause a lot of harm.

An alternative: qualitative coding

One way to overcome these problems would be if they didn’t worry about producing numbers at all. Instead of assigning a grade to responses, they could code them. In other words, take the themes above, and make notes on the responses, for example maybe highlighting the themes. We can see a few themes in the grading rubric:

  • Boosting profits
  • General recommendations
  • Tactical actions
  • Alignment to channel strategy
  • Implementation details
  • Business reasoning
  • Connection to client concerns
  • Connection to profit impact
  • Connection to historical pain points
  • Connection to belief audits

So instead of giving each answer a numerical grade, the graders could highlight and label examples of “boosting profits,” “tactical answers,” and so on. I think this would have made the analysis much more interesting. It would also give the flexibility to discover new themes. With the way the rubric is set up now, if a participant wrote something interesting that’s not covered in the rubric, there’s no way for graders to identify that.

I can think of a few reasons why the authors might not have wanted to do this. It’s more time consuming, and sharing the full text of participants’ responses means researchers have to put more data privacy protections in place. Even so, that level of detail is crucial for the results to be convincing, and if other researchers want to try to reproduce the experiment and see if they get the same results.

Conclusion

The results of this paper don’t prove that using GPT-4 helps workers do better work. The grading system is not convincing to me at all. To be fair, it is *really difficult* to assign a number to the quality of people’s work. That’s not usually how we assess people’s work in “the real world.” Nobody agrees about the best way to do that. So when a paper comes out claiming to do that, we should approach that critically. Especially if the result seems to align with industry hype.

And look, I absolutely understand and respect that the authors of the paper are professional researchers who know what they’re doing. I’m pretty sure the authors will have thought about and discussed these issues. I just wish there was more discussion of why they chose to do the grading this way, and what kinds of quality controls they had in place. Maybe we’ll learn more as researchers discuss and argue about it.