Why Language Models Hallucinate
Large language models (LLMs) sometimes generate statements that are plausible but factually incorrect—a phenomenon commonly called "hallucination." We argue that these errors are not mysterious failures of architecture or reasoning, but rather predictable consequences of standard training and evaluation incentives.
We show (i) that hallucinations can be viewed as classification errors: when pretrained models cannot reliably distinguish a false statement from a true one, they may produce the false option rather than saying I don't know; (ii) that optimization of benchmark performance encourages guessing rather than abstaining, since most evaluation metrics penalize expressing uncertainty; and (iii) that a possible mitigation path lies in revising existing benchmarks to reward calibrated abstention, thus realigning incentives in model development.
Joint work with Santosh Vempala (Georgia Tech) and Ofir Nachum & Edwin Zhang (OpenAI).