Artificial intelligence (AI) reasoning models aren’t as smart as they’ve been made out to be. In fact, they don’t actually reason at all, researchers at Apple say.
Reasoning models, such as Meta’s Claude, OpenAI’s o3 and DeepSeek’s R1, are specialized large language models (LLMs) that dedicate more time and computing power to produce more accurate responses than their traditional predecessors.
The rise of these models has led to renewed claims from big tech firms that they could be on the verge of developing machines with artificial general intelligence (AGI) — systems that outperform humans at most tasks.
Yet a new study, published June 7 on Apple’s Machine Learning Research website, has responded by landing a major blow against the company’s competitors. Reasoning models don’t just fail to show generalized reasoning, the scientists say in the study, their accuracy completely collapses when tasks get too complex.
“Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities,” the researchers wrote in the study. “Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.”
LLMs grow and learn by absorbing training data from vast quantities of human output. Drawing upon this data enables models to generate probabilistic patterns from their neural networks by feeding them forward when given a prompt.
Related: AI ‘hallucinates’ constantly, but there’s a solution
Reasoning models are an attempt to further boost AI’s accuracy using a process known as “chain-of-thought.” It works by tracing patterns through this data using multi-step responses, mimicking how humans might deploy logic to arrive at a conclusion.
This gives the chatbots the ability to reevaluate their reasoning, enabling them to tackle more complex tasks with greater accuracy. During the chain-of-thought process, models spell out their logic in plain language for every step they take so that their actions can be easily observed.
However, as this process is rooted in statistical guesswork instead of any real understanding, chatbots have a marked tendency to ‘hallucinate’ — throwing out erroneous responses, lying when their data doesn’t have the answers, and dispensing bizarre and occasionally harmful advice to users.
An OpenAI technical report has highlighted that reasoning models are much more likely to be derailed by hallucinations than their generic counterparts, with the problem only getting worse as models advance.
When tasked with summarizing facts about people, the company’s o3 and o4-mini models produced erroneous information 33% and 48% of the time, respectively, compared to the 16% hallucination rate of its earlier o1 model. OpenAI representatives said they don’t know why this is happening, concluding that “more research is needed to understand the cause of these results.”
“We believe the lack of systematic analyses investigating these questions is due to limitations in current evaluation paradigms,” the authors wrote in Apple’s new study. “Existing evaluations predominantly focus on established mathematical and coding benchmarks, which, while valuable, often suffer from data contamination issues and do not allow for controlled experimental conditions across different settings and complexities. Moreover, these evaluations do not provide insights into the structure and quality of reasoning traces.”
Peeking inside the black box
To delve deeper into these issues, the authors of the new study set generic and reasoning bots — which include OpenAI’s o1 and o3 models, DeepSeek R1, Anthropic’s Claude 3.7 Sonnet, Google’s Gemini — four classic puzzles to solve (river crossing, checker jumping, block-stacking, and The Tower of Hanoi). They were then able to adjust the puzzles’ complexity between low, medium and high by adding more pieces to them.
For the low-complexity tasks, the researchers found that generic models had the edge on their reasoning counterparts, solving problems without the additional computational costs introduced by reasoning chains. As tasks became more complex, the reasoning models gained an advantage, but this didn’t last when faced with highly complex puzzles, as the performance of both models “collapsed to zero.”
Upon passing a critical threshold, reasoning models reduced the tokens (the fundamental building blocks models break data down into) they assigned to more complex tasks, suggesting that they were reasoning less and had fundamental limitations in maintaining chains-of-thought. And the models continued to hit these snags even when given solutions.
“When we provided the solution algorithm for the Tower of Hanoi to the models, their performance on this puzzle did not improve,” the authors wrote in the study. “Moreover, investigating the first failure move of the models revealed surprising behaviours. For instance, they could perform up to 100 correct moves in the Tower of Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle.”
The findings point to models relying more heavily on pattern recognition, and less on emergent logic, than those who herald imminent machine intelligence claim. But the researchers do highlight key limitations to their study, including that the problems only represent a “narrow slice” of the potential reasoning tasks that the models could be assigned.
Apple also has a lagging horse in the AI race. The company is trailing its rivals with Siri being found by one analysis to be 25% less accurate than ChatGPT at answering queries, and is instead prioritizing development of on-device, efficient AI over large reasoning models.
This has inevitably led some to accuse Apple of sour grapes. “Apple’s brilliant new AI strategy is to prove it doesn’t exist,” Pedros Domingos, a professor emeritus of computer science and engineering at the University of Washington, wrote jokingly on X.
Nonetheless, some AI researchers have heralded the study as a necessary heaping of cold water on grandiose claims about current AI tools’ ability to one day become superintelligent.
“Apple did more for AI than anyone else: they proved through peer-reviewed publications that LLMs are just neural networks and, as such, have all the limitations of other neural networks trained in a supervised way, which I and a few other voices tried to convey, but the noise from a bunch of AGI-feelers and their sycophants was too loud,” Andriy Burkov, an AI expert and former machine learning team leader at research advisory firm Gartner, wrote on X. “Now, I hope, the scientists will return to do real science by studying LLMs as mathematicians study functions and not by talking to them as psychiatrists talk to sick people.”