Summits and Goalposts
In debates about AI progress, when is it justified to “move the goalposts”?
It’s fashionable to liken AI to a mirror which passively reflects what’s already present. But AI can also act like a contrast dye, actively showing what’s missing.
I saw this most clearly in my graduate seminar’s discussion of Valerie Tiberius’s just-published, Artificially Yours: Real Friendship in a World of Chatbots.
Tiberius rejects both Techno-Optimists who think that AI companions will solve the loneliness epidemic, and Techno-Pessimists convinced they’re nothing more than Big Tech’s latest affront to human dignity. She argues that different kinds of friendships can be valuable in different ways, some of which AI companions can deliver, some of which they can’t.
Before AI companions, many people had meaningful online relationships carried out entirely through text: Online support groups for illness and grief, MMO guilds, pen-pals, etc. One question Artificially Yours forces us to reckon with is: if people can have online, text-only relationships with AIs, wouldn’t that count as friendship, too?
When I raised this question in class, a student who had developed these kinds of online friendships (with humans, on are.na) pointed out that the novel and pressing possibility of AI friends helped to clarify something they hadn’t realized was important about their online human friendships. Namely, there was an implicit assumption that, if one of them travelled near the other, they’d meet up. If they learned an are.na friend had a meeting just a short drive away and didn’t reach out, they’d rightly feel hurt. One cannot assume this with AIs because they are not the kinds of things that can meet up in-person.
The closer AI gets to some human-like capability, the easier it is to see what’s missing.
Helen Toner’s recent essay argued that “AGI” was a more useful concept when we were farther away from it. The point of talking about AGI circa 2006 was “to contrast with the ‘narrow’ AI systems that existed at the time” doing single tasks like detecting credit card fraud, or filtering out spam e-mails.
Back then:
It was helpful to be able to gesture in the direction of much more capable, general-purpose systems that we might one day develop. But that’s changed. Today’s best AI systems are good enough that they’re now inside the fuzzy conceptual cloud of “AGI-ish”: that is, they’ve surpassed some people’s definitions of AGI, while falling well short of others’. As a result, talking about “AGI” is no longer a helpful way to gesture in a rough direction.
Proximity has a way of enforcing conceptual clarity.
Here’s another example:
This is half-correct. Something important is happening with false summits, when capabilities we once thought sufficient for “friendship”, “intelligence”, “reasoning”, “understanding”, or “creativity” no longer seem so, once achieved by LLMs. But the conceit that “we of course keep moving the goalposts” is too quick.
This leads to an important meta-question about AI progress: How can we tell the difference between legitimate false summits and illegitimate goalpost moving?
To maintain the belief that the Earth was the centre of the universe, Ptolemaic astronomers continually added epicycles to explain planetary motion. These ad hoc additions did explain planetary data retroactively, but they failed to predict future novel phenomena.
Contrast this with the discovery of Neptune. Astronomers noticed that Uranus’s orbit deviated from Newtonian predictions. So, they proposed an unseen planet tugging on Uranus. They calculated exactly where this hypothetical planet should be, and then Johann Galle looked through his telescope and found Neptune there.
The difference here between Ptolemaics and Newtonians is captured by Imre Lakatos’s distinction between progressive and degenerating research programs. When faced with anomalies, a progressive program modifies hypotheses in ways that generate novel predictions and insights. A degenerating program modifies its hypotheses to absorb counterevidence without producing new understanding.
This distinction provides a useful heuristic for debates about AI progress.
ARC-AGI
François Chollet’s (2019) “On the measure of intelligence” provides a positive, theoretically motivated account of intelligence as skill-acquisition efficiency. It then operationalizes it as a falsifiable instrument: The Abstraction and Reasoning Corpus benchmark. State-of-the-art models at the time scored around 5%, while humans were near-ceiling.
In late 2024, OpenAI’s o3 model hit ~80% on ARC-AGI-1, surpassing the human baseline for the first time. ARC-AGI-2 was then launched in early 2025. It took frontier models about a year to approach human-level. In March of this year, the interactive ARC-AGI-3 benchmark was released. As of this writing, base models sit near-floor (GPT-5.5: 0.43%; Opus-4.7: 0.18%).
When ARC-AGI-4 is released, will Chollet and colleagues have moved the goalposts? I don’t think so. Theirs is like a progressive research program. Each release is a positive instrument designed for the next (probably false) summit. For ARC-AGI-2 it was symbolic interpretation, compositional reasoning, and rule application. For ARC-AGI-3 it’s exploration, modelling, goal-setting, and planning. Each release does the equivalent of pre-registering what it would look like if the planet were there. A degenerating program, by contrast, moves the goalposts, bolting on denials to protect the pre-ordained conclusion “AI isn’t intelligent.”
We didn’t know that “we’d meet up if we were nearby” was partially constitutive of online friendships until the AI companions threw it into relief. Similarly, we used to think common sense factoids captured something essential about “understanding” until LLMs saturated CommonSenseQA.
AI “near-misses” often function like contrast dyes, helping us to see that implicitly held criteria were doing important work. What should we do when the next human-like capability falls to LLMs and new criteria are made visible?
Writers worth reading will have a positive account. They’ll build new instruments, specify operationalizations, update theories, and stake new predictions. But many commentators will be quick with a negative account, concocting epicycle-equivalent rationalizations to insulate pre-established conclusions about what AI can’t do. This is the difference between legitimate false summits and illegitimate goalpost moving: whether the criticism is progressive or degenerating.
