Papers
arxiv:2604.12373

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Published on Apr 14
· Submitted by
Tomer Ashuach
on Apr 15
Authors:
,
,
,
,

Abstract

Large language models do not demonstrate superior self-awareness of answer correctness compared to external models, though they show domain-specific advantages in factual knowledge tasks when models disagree on predictions.

AI-generated summary

Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.

Community

Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

TL;DR: Do LLMs have internal signals about whether they'll answer correctly — signals that external models can't access? We find the answer depends on the domain: yes for factual knowledge, no for mathematical reasoning.

privileged_knowledge

Key Findings:

  • Inter-model agreement masks privileged knowledge. Self-probes appear no better than cross-model probes on standard evaluations — but this is because models agree on ~80% of questions, letting external probes piggyback on shared difficulty patterns.
  • Disagreement subsets reveal genuine self-knowledge. When we restrict evaluation to questions where models disagree, self-representations consistently outperform peer representations on factual tasks (~5% premium gap, statistically significant across all configurations).
  • Math correctness remains publicly observable. In mathematical reasoning, no premium gap emerges at any layer depth — suggesting correctness is governed by problem structure rather than model-specific retrieval.
  • The signal builds through depth. Layer-wise analysis shows the factual advantage emerges progressively from early-to-mid layers onward, consistent with idiosyncratic memory retrieval accumulating through the forward pass.

premium_gap_lr

heatmap_gap_lr

Accepted at ACL 2026 (Main Conference)
Authors: Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, and Yonatan Belinkov

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.12373
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.12373 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.12373 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.12373 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.