ChatGPT Was Asked the Same Question 10 Times. The Answers Kept Changing

Humanoid Artificial Intelligence Robot Confused AI Questions Thought — ChatGPT may sound confident, but when tested on complex scientific claims, it often guesses and even contradicts itself. Researchers found it struggles especially with spotting false information. Credit: Shutterstock

ChatGPT can sound convincing, but this study shows it still struggles to tell what’s actually true.

Washington State University professor Mesut Cicek and his team repeatedly evaluated ChatGPT by giving it hypotheses drawn from scientific studies. The AI was asked to decide whether each statement was supported by research — essentially judging if it was true or false.

In total, the researchers tested more than 700 hypotheses and submitted each one 10 times to examine how consistent the responses would be.

Accuracy Results and Performance Limits

In the initial 2024 experiment, ChatGPT answered correctly 76.5% of the time. When the study was repeated in 2025, accuracy rose slightly to 80%. However, once the results were adjusted for random guessing, the performance looked far less reliable. The AI was only about 60% better than chance, which the researchers described as closer to a low D than strong performance.

The system had particular difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent results for only about 73% of the cases.

Inconsistent Answers to Identical Questions

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” said Cicek, an associate professor in the Department of Marketing and International Business in WSU’s Carson College of Business and lead author of the new publication.

“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false.”

AI Fluency Versus Real Understanding

The study, published in the Rutgers Business Review, highlights the importance of caution when using AI for important decisions, especially those involving nuance or complex reasoning. While generative AI can produce fluent and convincing language, it does not necessarily demonstrate true understanding.

Cicek said the findings suggest that artificial general intelligence capable of genuine reasoning may still be further away than some expect.

“Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,’” Cicek said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”

Study Design and Methods

Cicek worked alongside Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University.

The team analyzed 719 hypotheses from scientific papers published in business journals since 2021. Determining whether research supports a hypothesis is often complex, involving multiple factors that can influence the outcome. Reducing that complexity to a simple true-or-false decision requires careful reasoning.

The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, results were similar across both versions. After adjusting for random chance, which gives a 50% likelihood of a correct answer, the AI’s performance was only about 60% better than chance in both years.

Key Weakness in AI Reasoning

The findings reveal an important limitation of large language model AI systems. Although they can generate polished and persuasive responses, they often struggle with deeper reasoning. This can lead to answers that sound convincing but are actually incorrect, Cicek said.

Why Experts Urge Caution

Based on these results, the researchers recommend that business leaders verify AI-generated outputs and approach them with skepticism. They also emphasize the importance of training users to understand both the strengths and limitations of AI tools.

While this study focused on ChatGPT, Cicek noted that similar tests with other AI systems have shown comparable outcomes. The research also builds on earlier work highlighting concerns about AI hype. A 2024 national survey found that consumers were less likely to purchase products when they were marketed with a focus on AI.

“Always be skeptical,” he said. “I’m not against AI. I’m using it. But you need to be very careful.”

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.