Did Scientists Overestimate AI’s Ability To Think Like Humans?

AI Robot Thinking Artificial Intelligence
An AI model once claimed to replicate human cognitive behavior across a wide range of tasks, sparking excitement about unified theories of the mind. However, new findings suggest its performance may rely more on learned patterns than true understanding, raising deeper questions about how intelligence is defined and measured. Credit: Shutterstock

A new wave of AI research is attempting to tackle one of psychology’s oldest questions: whether the human mind can be unified under a single theory.

For decades, psychologists have debated a central question: can the human mind be explained by a single, unified theory, or must processes like memory, attention, and decision making be studied as separate systems? That question is now being revisited through an unexpected lens. Advances in artificial intelligence are offering researchers a new way to test what “understanding” really means.

In July 2025, a study published in Nature introduced an AI model called “Centaur.” Built on existing large language models and refined with data from psychological experiments, the system was designed to mimic how people think and make decisions.

According to its creators, Centaur could replicate human-like responses across 160 different cognitive tasks, spanning areas such as executive control and choice behavior. The results were widely interpreted as a potential breakthrough, suggesting that AI might begin to approximate a general model of human cognition.

A Challenge to the Centaur Model

A more recent study published in National Science Open has cast doubt on these claims. Researchers from Zhejiang University argue that Centaur’s apparent “human cognitive simulation ability” is likely due to overfitting, meaning the model may have memorized patterns in the training data rather than understood the tasks themselves.

To test this idea, the team created several experimental setups. In one example, they replaced the original multiple-choice prompts, which described specific psychological tasks, with a simple instruction: “Please choose option A.” If the model truly understood the task, it should have selected option A every time. Instead, Centaur continued to produce the same “correct answers” found in the original dataset.

This behavior suggests the model was not interpreting the meaning of the questions. Rather, it relied on statistical associations to arrive at answers, similar to a student who scores well by recognizing patterns without actually understanding the material.

Implications for Evaluating AI Systems

The findings highlight the need for more careful evaluation of large language models. Although these systems are highly effective at fitting patterns in data, their “black-box” design makes them vulnerable to problems such as hallucinations and misinterpretation. Rigorous and multi-faceted testing is necessary to determine whether a model truly demonstrates the abilities it appears to have.

Despite being described as a “cognitive simulation” system, Centaur’s most notable weakness lies in language comprehension, particularly its ability to grasp the intent behind questions. The study suggests that achieving genuine language understanding may remain one of the biggest challenges in developing general models of cognition.

Reference: “Can Centaur truly simulate human cognition? The fundamental limitation of instruction understanding” by Wei Liu and Nai Ding, 11 December 2025, National Science Open.
DOI: 10.1360/nso/20250053

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.