Touching Grass
Why debates about AI progress and automation need to get off social media and into the weeds
In Experience and Nature (1925) John Dewey describes the “philosopher’s fallacy” whereby the refined products of intellectual inquiry are “converted into antecedent existences.” The problem isn’t with abstraction itself, but taking what comes out of a process of abstraction and treating it as what was there before. Put another way: don’t mistake the map for the territory.
Much of the recent discourse about AI progress and automation makes this mistake.
The Real Story
Benchmarks are some of the best tools we have for mapping AI progress. They provide a window into the kinds of tasks that are just-out-of-reach for frontier AIs, thereby giving them a hill to climb. An underappreciated corollary is that there are also many tasks that are so far out-of-reach that no one has bothered designing a benchmark. We tend to focus on the former and not the latter. Is anyone talking about how well Gemini 4 will score on ECEval, a benchmark designed to measure AI progress toward a drop-in pre-school teacher? Of course not. Early Childhood Educator tasks are so far out-of-reach that there’d be no training signal.
Instead, the news is that Claude Opus 4.6 beats GPT-5.2 by 144 Elo points on GDPval-AA. To be clear, this is extremely impressive. But the real benchmark story is that a decade ago, the tasks that were just-out-of-reach were identifying cats in the ImageNet database. Today, the tasks that are just-out-of-reach look an awful lot like running a business.
From this vantage point, large-scale labour displacement can feel inevitable. As Leopold Aschenbrenner says, “this doesn’t require believing in sci-fi; it just requires believing in straight lines on a graph.”
Nowhere is this more prevalent than with METR’s “Measuring AI Ability to Complete Long Tasks” benchmark. Otherwise careful thinkers are quick to infer that the straight lines on the log-graph mean we are on the cusp of fully automating software engineering. Less careful thinkers generalize to “any job that mostly involves typing on a computer.”
But Nathan Witkin has pointed out that, if you read the fine print, METR themselves admit most of the tasks in their benchmark are not representative of actual software engineering work, much less white-collar work. They are instead discrete, linear, easily measurable, algorithmically scorable stand-ins for software engineering sub-tasks. OpenAI say much the same about their GDPval benchmark.
What Dewey helps us see is that there’s actually nothing wrong with this: “Selective emphasis, choice, is inevitable…this is not an evil.” We often can’t directly measure the things we care about, so we use more easily measurable stand-ins or proxies. The problem only arises when we mistake these stand-ins for the real thing. “Deception,” Dewey continues, “comes only when the presence and operation of this choice is concealed, disguised, denied.”
Most interpretations of the METR results are deceptive in this sense. By METR’s own lights, there are a small sub-set of “messy” tasks that better represent real work done by software engineers. Witkin argues that, on these tasks, no model topped a 30% success rate, and this should have been the headline result.
Instead, we get social media posts like, “CLAUDE OPUS 4.6 HAS 14.5 HOUR TIME HORIZON!” followed by breathless commentaries about the incoming wave of automation and the “great disemboweling of white-collar jobs.”
AI progress is indeed alarmingly fast, and speeding up. It seems rational for Substack-reader-and-writer types (present author included) to worry about how and when AI will affect their livelihood. But we won’t make much progress on this question by having a bunch of mostly non-software engineers debating the merits of a blog post about crude stand-ins for software engineering sub-tasks. Like most Very Online debates, this one would benefit from touching grass.
I was recently approached by a healthcare organization, who, like most in Ontario, had an appallingly long waitlist for new patients. And like most healthcare organizations, they were optimistic that AI could deliver some much-needed efficiencies.
Indeed, the healthcare sector has been one of the most rapid adopters of AI. According to the AMA, 66% of physicians used AI in 2024, up from 38% in 2023. ChatGPT now fields 40 million health-related queries per day. Tens of billions were invested in AI-powered health tech in 2025.
AI for clinical documentation is consistently ranked as a top use case, and for good reason. It is the lowest hanging fruit on the automation tree: a repetitive, low-risk, high-volume, language-in-language-out task. Right in the wheelhouse of today’s LLMs. This made sense as a place to start on our collaboration.
The idea was that clinicians would jot down a rough “scratch note” between patient visits, and we could then use LLMs to transform that scratch note into a structured SOAP note (a standard format, consisting four sections: Subjective, Objective, Assessment, Plan). We interviewed 20 clinicians about their experience with this process.
To understand what we found, and what it means for the broader debates about AI progress and automation, we have to go into the weeds (if you want to go deeper into the weeds, you can read the paper).
Write on paper, wrong in practice: 4 themes
Heterogeneity
The clinicians in our study were mostly occupational therapists working in pediatric rehabilitation. Some worked in schools, others in the clinic. The first theme that came out during our interviews was that documentation workflows varied considerably. Some wrote scratch notes that could be easily fed to the LLMs. Others used their own shorthand for scratch notes which couldn’t easily be fed into the LLMs. School-based clinicians, working in a classroom, didn’t have the luxury of writing scratch notes at all. Some clinician’s “scratch” notes were so detailed that there wasn’t much for the LLM to do. Even within the narrow subset of clinical documentation for occupational therapists working in pediatric rehabilitation, there wasn’t a single process to be automated.
Countability
Another theme that came up repeatedly in our interviews was that “SOAP notes aren’t the problem.” Here’s two representative quotes:
“No, I think if I had to tell you what I think the problem is, I don’t think it’s our ability to write, It’s the amount of things we have to do.”
“I think I was intrigued by it at first and then the reality of it, I was like, it actually doesn’t save me time...I think our other processes are extremely inefficient.”
Even if LLMs were able to reduce their documentation burden, clinicians were skeptical that it would improve their overall situation. This echoes C. Thi Nguyen’s insight in The Score that, within organizations, “easy countability,” understood here as time spent writing notes, “automatically wins out over actual importance.”
Identity
A pervasive assumption in the academic literature is that clinical documentation is a burdensome task that should, to the extent possible, be automated away. And for many clinicians, this assumption holds. Their lives would go better if they could reduce the amount of time spent writing notes in their pajamas.
But a few clinicians in our study pointed out that note-writing can be an expression of professional identity:
“Like even in an objective way where I’m stating ‘objective’ [the ‘O’ in SOAP]...It still has your person like, it has your sound to it. And so when you’re reading and it doesn’t have your sound, like this is wild.”
Others noticed how they could guess where a clinician trained, based on their documentation style.
Many clinicians view their notes as a burden better offloaded. But some view them as an outlet for their clinical reasoning and judgment, in which case the LLM tools looked like a solution in search of a problem.
Tools
Our study and many others have found that, for some clinicians, time spent prompting, reviewing, editing, and iterating with LLMs erases productivity gains. As one of our participants said:
“I’m like this is too much effort for me to tell you what to do. I already know what I wanna do, so I would just use my own note like I would just abandon it altogether.”
Still, even if LLMs don’t save time, they can reduce burnout because many people find editing outputs to be less cognitively demanding than generating them.
In this context, perhaps our most interesting finding was that, among clinicians that didn’t outright abandon the tools, some reported adapting their own workflows by creating more AI-friendly scratch notes. Rather than tools helping clinicians, clinicians were helping the tools.
At this point, one might object that these are a bunch of cherry-picked examples from a small qualitative study of an un-representative population. What does the broader literature find?
The best evidence comes from an RCT conducted at UCLA: 238 physicians across 14 specialties, encompassing 72,000 patient encounters. Users of one tool (Nabla) found a 10% reduction - about 40 seconds per note - in documentation time. Another tool (DAX) showed smaller, non-significant reductions.
The largest longitudinal study to date is from Kaiser Permanente: over 7,200 physicians, 2.5 million patient encounters. The paper reports 15,700 hours saved collectively, with after-hours documentation reduced, on average, by a minute per appointment. 84% of physicians said AI improved ability to connect with patients and 82% reported greater job satisfaction.
A smaller longitudinal study (200+ clinicians over 180 days) found no speed up, on average. And a recent study out of UCSF showed that 86% of physicians perceived documentation time reductions when using AI tools, but these subjective impressions did not track any objective time-savings. This last point mirrors another METR study which found that software engineers subjectively felt like AI sped them up by around 20%, but it in fact slowed them down by around 20%.
So much for the weeds.
What do we make of these findings? Our qualitative study found limited benefits of LLMs for clinical documentation. The quantitative literature is mixed but seems to find real, but modest time savings and burnout reduction. Solid, but not transformative. Normal for this kind of technology, we might say.
What does this tell us about AI and automation more generally?
It’s difficult to imagine a real-world task better suited for today’s LLMs than this kind of clinical documentation. It really does seem like the lowest-hanging fruit, and so it would be easy to infer that if LLMs struggle here, we should be skeptical of claims about broader automation and job loss. While I’m sympathetic to this claim and how people like David Oks have argued for it, I think Derek Thompson is mostly right that nobody really knows.
But if we want to make epistemic progress, there’s no substitute for going into the weeds and studying how actual people are using actual AI tools to do actual tasks within their jobs. From this perspective, one is much less likely to commit the Philosopher’s Fallacy. There’s no mistaking the stand-in for the real thing: Writing clinical notes is just one small sub-task within the broader task of “indirect” clinical work. And this “indirect” work is itself ancillary to the much broader and infinitely more complex set of meta-tasks, tasks and sub-tasks associated with “direct” patient care.
None of this is this unique to healthcare. In software engineering – arguably the ripest job for AI automation - Steve Newman (an actual software engineer) makes the same point: Benchmarks are not job tasks, and even if they were, jobs are still more than bundles of tasks.
And yet, quantitatively and qualitatively, AI progress continues to accelerate. The question we’re left with is whether trillions of dollars of compute can summon the real thing from the stand-ins.

I use AI tools frequently in clinical practice as a specialty physician. It saves me a few hours per week, and makes my job much less burdensome. It’s getting better every month.
I have worked in software engineering for over a decade, in organizations of various sizes, including house-hold brand names, both as a people leader and individual contributor. I no longer write any code, even in professional contexts. The same is true for the most experienced engineers I know. There is still work to be done elsewhere in the software development process, but we should expect this to be automated eventually as well (it may even be less a problem of further intelligence gains than diffusion of existing capabilities).
Full automation of software production will only accelerate the automation of other fields, as it becomes trivial to write bespoke software to handle various problems. Anecdotal disappointment in LLM performance is a useful reflection on this point in time, but I see every indication that capabilities will advance rapidly. Nearly all white-collar work is at risk when measured over a 5-10 year time horizon.