Hallucinations debunked

An overview of the state of AI's most annoying problem

Oct 01, 2025

OpenAI recently released a paper called Why Language Models Hallucinate (link). It digs into why models are inherently incentivized by their training to prefer giving any answer, even a wrong one, rather than admitting they don’t know. Hallucinations are one of the most familiar issues for anyone using AI, and this paper pulls back the curtain on why they happen.

What’s wild is that in 2025, this still isn’t a solved problem. I wanted to take a step back and clarify what hallucinations really are, how people are dealing with them today, and what’s possible going forward. As an investor, it’s especially important to understand whether claims of being “hallucination free” can ever actually be true.

What are hallucinations?

There was a funny scene in Tucker Carlson’s recent podcast interview with Sam Altman, where he was probing him on whether ChatGPT lies to its users. The question was really, why do AI models lie, to which Altman clarified the distinction between lying and hallucinating. Lying implies intent, which Altman says AI lacks. Hallucinating, however, describes when the AI generates incorrect but plausible-sounding information based on its training data. These mistakes happen because the AI is essentially predicting the most statistically likely response based on its training data, not because it has an intent to deceive.

If you’ve spent time with language models, you’ve seen this. A model might cite a research paper that doesn’t exist or invent a plausible-sounding law. These are classic hallucinations: outputs that sound right but collapse under scrutiny.

OpenAI’s paper defines hallucination as “an important special case of errors produced by language models” or more specifically, overconfident, plausible falsehoods. The danger comes from that combination. If a model said “I don’t know,” the mistake would be obvious. But because it produces fluent language that looks certain, it’s easy to confuse a falsehood with truth.

We can think about model errors more broadly, and hallucinations are just one subset. The main categories people talk about are:

Fabrication: inventing entirely fake objects (papers, APIs, laws).
Reasoning errors: faulty logic even when the facts are correct.
Miscalibration: being overly confident in wrong answers.
Capability errors: failing to solve a task at all, like a math proof.

Hallucinations sit at the intersection of these categories. Fabrications are the clearest case, since they involve making things up. Miscalibration amplifies the risk, because falsehoods are delivered with high confidence. Reasoning errors can look like hallucinations when the model produces a chain of logic that sounds plausible but is wrong.

Capability errors are interesting to me. They are not strictly hallucinations, but because models rarely refuse to answer, their failed attempts often look like them. The distinction is in the cause. A hallucination happens when the model could in principle do the task but produces a falsehood anyway, often because of calibration issues, gaps in data coverage, or sampling. A capability error happens when the model fundamentally cannot perform the task, either because it requires reasoning or compositional skills that the model lacks, or because the task needs external tools like a calculator, code interpreter, or retrieval engine.

It’s tempting to think capability errors are just data gaps, and that if we gave the model more examples, it would eventually learn. But that’s not always true. More data helps with recall of facts, but it does not solve reasoning or compositional limits. You can drastically improve reasoning with scale and data, but next-token prediction alone will not yield systematic, reliable reasoning on open-ended tasks. Without changes to the training objective or architecture, models will still fail unpredictably outside their training distribution. That’s why researchers are exploring new directions: tool-augmented models, chain-of-thought supervision, process-based rewards, and other methods that target competence rather than recall.

So while hallucinations are mainly about truthfulness, capability errors are about competence. They sometimes blur together in practice, but it’s worth keeping the distinction clear.

Why exactly does it happen

To understand hallucinations, it helps to revisit how models are trained and how they generate answers. During training, language models learn to assign probabilities to the next token given the previous ones, across massive text corpora. At inference time, they have to turn those probabilities into an actual answer, and this is where decoding strategies come in.

The simplest method, greedy decoding, always picks the single most likely next token. Most real systems don’t do this, because it produces repetitive or overly narrow text. Instead they use sampling methods (temperature, nucleus sampling, best-of-n) or constrained decoding. These choices matter: higher temperature and permissive sampling increase the chance of hallucination, while constrained decoding and grammar-based approaches can reduce it, especially in structured tasks like code or SQL.

Why does the model sometimes pick the wrong path?

Knowledge gaps and uncertainty. Models don’t know everything, and not all questions have a single clear answer. Two kinds of uncertainty matter:
- Epistemic uncertainty comes from the model not having enough knowledge. For example, if a minor politician’s birthday appeared only once in the training data, the model can’t reliably recall it and will often guess.
- Aleatoric uncertainty comes from the world itself being ambiguous. For example, if you ask “Who is the best jazz pianist?” or “What’s the current population of New York?”, there isn’t a single correct answer. Even humans would disagree or give a range.
On top of that, the world changes. Facts that shift after the training cutoff (e.g., “Who is the current UK Prime Minister?”) won’t be captured directly in the model weights. Unless you pair the model with retrieval or another live data source, it will hallucinate or confidently give outdated information.
Objective mismatch. During training, models learn by minimizing a loss function that rewards them for predicting the next word correctly. The standard version of this, cross-entropy loss, rewards the model for matching the ground-truth continuation. If the training data contains “I don’t know” as the true response, the model can learn to produce it. But because most training data overwhelmingly contains answers, not abstentions, the model is implicitly encouraged to always produce something rather than refuse.
Researchers have explored ways to fix this, such as training models to abstain when confidence is low, or calibrating their probabilities so confidence better reflects accuracy. These approaches reduce hallucinations, but they come with trade-offs: if the model abstains too often, it becomes less useful.
Post-training incentives. After pretraining, methods like RLHF and DPO adjust models to be more helpful and fluent. But the way benchmarks and preference datasets are structured often penalizes hedging or uncertainty. A model that says “I’m not sure” can be scored worse than one that confidently gives a wrong answer. That creates an incentive for overconfidence.

So hallucinations are really a structural byproduct of how every large language model is built. Then how are people mitigating this in practice?

Current methods to “solve” hallucinations

Most attempts to “solve” hallucinations are really about mitigation. The core training objective hasn’t changed, so what we see are layered strategies that reduce, detect, or contain hallucinations rather than eliminating them outright.

Retrieval-augmented generation (RAG). The idea is simple: give the model fresher or more relevant context so it has less room to make things up. RAG does help, but the effect depends entirely on retrieval quality. If recall is low, if the corpus is stale, or if the retrieved passages are irrelevant, the model will still hallucinate. RAG improves faithfulness (sticking to the retrieved context) but it does not guarantee factuality (being correct relative to the real world) when the underlying database itself is incomplete or wrong.

Verification and cross-checking. Some systems run a second model as a judge, or compare multiple generations for consistency. You can also use symbolic validators or tool calls to check hard constraints, like whether code runs or an equation balances. These methods can catch errors, but they are brittle. Recent studies show that many “LLM-as-a-judge” systems fail under closer evaluation, sometimes performing no better than crude heuristics like checking response length. Verification helps, but it is not yet robust across domains.

Uncertainty estimation and selective answering. A more principled approach is to let models say “I don’t know.” Log probabilities provide one proxy, but raw logprobs are poorly calibrated. Research on calibration metrics (like expected calibration error or Brier scores) shows that models are often overconfident in wrong answers and underconfident in correct ones. Some groups train explicit uncertainty estimators or use selective prediction, where the model abstains if confidence is below a threshold. This reduces hallucinations at the cost of coverage, which can be a worthwhile trade in high-stakes applications.

Hidden-state probes. A newer line of work uses probes on internal activations to flag likely hallucinations in real time. This is promising because it doesn’t rely on surface heuristics, but it remains fragile. Probes often fail to generalize across prompts or domains, so they are not yet a drop-in solution. Think of them as an early alarm system, not a cure.

Fine-tuning. I know I listed post-training as one of the main reasons hallucinations persist, but it can also be part of the solution. When alignment data and benchmarks reward fluency and helpfulness, models learn to be confident guessers, which increases hallucinations. But when fine-tuning explicitly rewards calibrated abstention, grounded answers, or citing sources, it can push models toward more reliable behavior. Instruction tuning, RLHF, or DPO are just methods, what matters is the signal you give them. With carefully designed data, fine-tuning can reduce the frequency or severity of hallucinations.

The bottom line is that no single method guarantees reliability. Strong systems combine retrieval, grounding, verification, abstention, and conservative decoding. The mindset has shifted from chasing “hallucination-free” models to engineering hallucination-resilient stacks.

Why evaluation matters

Evaluation is a current hot topic in the field. Old benchmarks that asked only “was the answer correct” are not enough. Modern evaluations need to distinguish between faithfulness and factuality. These are measured differently. Faithfulness can be checked with span-level overlap against retrieved documents. Factuality often requires external verification, human annotation, or curated gold standards.

Selective prediction curves are becoming standard. They show how accuracy improves as the system abstains more often, which makes the trade-off between coverage and reliability explicit. This is critical for real deployments, where a system that answers 90% of the time with 95% accuracy may be better than one that answers 100% of the time with 85% accuracy.

Operational metrics also matter: percentage of responses with grounded citations, agreement between independent verifiers, escalation rates to human review, latency and cost overhead of verification layers. These tell you whether a system is production-ready.

Without rigorous evaluation, claims about hallucination-free models are meaningless. The question is not “does it hallucinate” but “how often, in what way, and under what operating conditions.”

Wrapping up

Whether we can ever completely eliminate hallucinations from the models is still an open question. Perhaps with the current mode of training and model architectures, we will always need extensive scaffolding to mitigate them. However, there are some new training paradigms that offer something to get excited about.

For example, FLAME introduces factuality-aware alignment to reshape both supervised fine-tuning and preference (DPO/RL) objectives toward truthfulness rather than fluency alone.
Another is Mask-DPO, which learns factual alignment at a finer granularity by masking nonfactual content during preference learning.
Then there’s UAlign, which embeds uncertainty estimations into alignment, helping models know when to refuse based on borderline knowledge.

All this to say: it’s fair to be skeptical when someone claims they’ve built a model with absolutely no hallucinations. You’d want to see the full stack: retrieval, verification, calibration, evaluation, and definitely an awareness of how these new alignment and steering methods can be applied.

Matthildur’s Substack

Discussion about this post

Ready for more?