Real-Time Detection of Hallucinated Entities in Long-Form Generation

Oscar Obeso* 1, Andy Arditi*, Javier Ferrando, Joshua Freeman1
Cameron Holmes2, Neel Nanda

1 ETH Zürich     2 MATS     * Co-first authors

Large language models are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Existing hallucination detection methods, however, are impractical for real-world use, as they are either limited to short factual queries or require costly external verification.

We present a cheap, scalable method for real-time identification of hallucinated tokens in long-form generations, and scale it effectively to 70B parameter models. Our approach targets entity-level hallucinations—e.g., fabricated names, dates, citations—rather than claim-level, thereby naturally mapping to token-level labels and enabling streaming detection. We develop an annotation methodology that leverages web search to annotate model responses with grounded labels indicating which tokens correspond to fabricated entities. This dataset enables us to train effective hallucination classifiers with simple and efficient methods such as linear probes.

Evaluating across four model families, our classifiers consistently outperform baselines on long-form responses, including more expensive methods such as semantic entropy (e.g., AUC 0.90 vs 0.71 for Llama-3.3-70B), and are also an improvement in short-form question-answering settings. Moreover, despite being trained only with entity-level labels, our probes effectively detect incorrect answers in mathematical reasoning tasks, indicating generalization beyond entities. While our annotation methodology is expensive, we find that annotated responses from one model can be used to train effective classifiers on other models; accordingly, we publicly release our datasets to facilitate reuse. Overall, our work suggests a promising new approach for scalable, real-world hallucination detection.

Hallucination risk (probe score): Low (0.2-0.4) Medium (0.4-0.7) High (0.7+)
Toggle streaming: On
User
|
User
|
User
|
User
|
User
|

Real-time hallucination detection demo. Our probes detect fabricated entities as they're generated, with higher scores indicating greater hallucination risk.

Introduction

Large language models (LLMs) are now routinely used in high-stakes applications where hallucinations can cause serious harm, such as medical consultations or legal advice. Several recent works have addressed hallucination detection in short-form question-answering (QA) settings, where, in general, completions are brief (1-2 sentences), contain a single atomic claim, and correctness can be unambiguously labeled. However, real-world LLM usage increasingly involves open-ended long-form generation, where models produce complex, multi-paragraph responses containing numerous interconnected claims.

Existing approaches for long-form hallucination detection, such as SAFE and FActScore, represent a prominent approach that employs external verification. These methods work by first extracting claims from the generated text, then retrieving supporting evidence from external sources, and finally evaluating each claim in light of the external evidence. While effective for comprehensive verification, these pipelines incur significant computational costs and latency, making them unsuitable for real-time detection during generation.

We introduce a token-level hallucination detection approach that enables cheap, real-time streaming monitoring during generation. We focus on entity-level hallucinations (e.g., fabricated names, dates, citations) rather than claim-level, as entities have clear token boundaries and can be verified as they appear. We train lightweight linear probes to predict from hidden activations whether a token is part of a hallucinated entity. We find that these probes consistently outperform other uncertainty-based baselines, including more expensive methods such as semantic entropy.

Methodology

Dataset construction for token-level hallucination detection

Automated annotation pipeline. We use a diverse set of prompts to elicit entity-rich long-form generations from target models. For each generation, we use a frontier LLM with web search to parse and verify all entities, yielding token-level hallucination labels.

To train token-level hallucination detectors, we need a dataset with precise annotations of which tokens correspond to hallucinated content.

Building off of the LongFact dataset, we constructed LongFact++, a collection of thousands of prompts designed to elicit detailed responses rich in specific entities like names, dates, and locations. Using these prompts, we generate entity-rich long-form completions from target models, which naturally contain a mix of hallucinated and non-hallucinated content.

We then annotate these completions using a frontier LLM (e.g., Claude) with web search access. For each completion, the frontier LLM extracts each entity in the text, searches for corroborating evidence online, and determines whether the entity is factually grounded. Since entities map directly to specific token sequences, these entity-level judgments provide us with the token-level training labels we need.

Training token-level probes

Given our token-level training labels, we train probes to predict whether each token is part of a hallucinated entity. We use linear probes that read hidden states from an intermediate layer of the model and output hallucination probabilities for each token.

For enhanced performance, we also experiment with LoRA adapters trained jointly with the probe. When using LoRA, we apply regularization to minimize changes to the model's original behavior while maximizing probe detection accuracy.

Results

Token-level probes detect hallucinated entities. In long-form generation settings (LongFact, HealthBench), linear probes far outperform uncertainty-based baselines, with LoRA probes improving performance even further. Our probes also perform well in short-form QA settings (TriviaQA), and out-of-distribution reasoning domains (MATH). Results for Llama-3.3-70B are displayed.

In long-form settings (LongFact and HealthBench), token-level probes markedly outperform baselines. Simple linear probes consistently achieve AUCs above 0.85, and LoRA probes improve even further, pushing AUCs above 0.89. In comparison, the uncertainty-based baselines all struggle, failing to exceed 0.76 AUC.

In the short-form QA setting (TriviaQA), the baselines are stronger than in the long-form setting, yet probes still lead. Our LoRA probes consistently achieve greater than 0.96 AUC, and linear probes also perform well.

Notably, our probes also achieve strong results on the MATH dataset. This out-of-distribution performance suggests our method captures signals of correctness that generalize beyond its original target of fabricated entities.

Additional experiments

Beyond our main results, we investigate several important questions about the generalization and practical applicability of our approach.

Long-form training data is necessary for long-form performance. We examine whether probes trained on short-form vs. long-form data generalize to the other setting. While probes trained on long-form data transfer well to short-form evaluation (<0.05 AUC gap), probes trained only on short-form data show substantially worse performance on long-form tasks (~0.10 AUC gap).

Probes generalize across different model families. Probes trained on one model can detect hallucinations in other models' outputs with only small performance drops (0.02-0.04 AUC) compared to same-model performance.

Real-time monitoring enables selective answering. Our probes enable real-time intervention during generation. As a proof of concept, we demonstrate selective answering: when probe scores exceed a threshold during generation, the system abstains rather than potentially hallucinating. This achieves higher conditional accuracy on attempted questions, though at the cost of attempting fewer total questions.

Conclusion

This work represents an initial step toward practical, real-time hallucination detection in long-form generation. By framing hallucination detection as a token-level sequence labeling problem, our streaming approach enables monitoring during generation without the computational overhead of external verification pipelines.

While significant challenges remain before practical deployment becomes feasible, our streaming detection approach demonstrates the feasibility of token-level hallucination monitoring and provides a promising foundation for advancing real-time factual reliability in language models.

Citation information

@misc{obeso2025realtimedetectionhallucinatedentities,
    title={Real-Time Detection of Hallucinated Entities in Long-Form Generation}, 
    author={Oscar Obeso and Andy Arditi and Javier Ferrando and Joshua Freeman and Cameron Holmes and Neel Nanda},
    year={2025},
    eprint={2509.03531},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2509.03531}, 
}