Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
Abstract
Large language models do not demonstrate superior self-awareness of answer correctness compared to external models, though they show domain-specific advantages in factual knowledge tasks when models disagree on predictions.
Humans use introspection to evaluate their understanding through private internal states inaccessible to external observers. We investigate whether large language models possess similar privileged knowledge about answer correctness, information unavailable through external observation. We train correctness classifiers on question representations from both a model's own hidden states and external models, testing whether self-representations provide a performance advantage. On standard evaluation, we find no advantage: self-probes perform comparably to peer-model probes. We hypothesize this is due to high inter-model agreement of answer correctness. To isolate genuine privileged knowledge, we evaluate on disagreement subsets, where models produce conflicting predictions. Here, we discover domain-specific privileged knowledge: self-representations consistently outperform peer representations in factual knowledge tasks, but show no advantage in math reasoning. We further localize this domain asymmetry across model layers, finding that the factual advantage emerges progressively from early-to-mid layers onward, consistent with model-specific memory retrieval, while math reasoning shows no consistent advantage at any depth.
Community
Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness
TL;DR: Do LLMs have internal signals about whether they'll answer correctly — signals that external models can't access? We find the answer depends on the domain: yes for factual knowledge, no for mathematical reasoning.
Key Findings:
- Inter-model agreement masks privileged knowledge. Self-probes appear no better than cross-model probes on standard evaluations — but this is because models agree on ~80% of questions, letting external probes piggyback on shared difficulty patterns.
- Disagreement subsets reveal genuine self-knowledge. When we restrict evaluation to questions where models disagree, self-representations consistently outperform peer representations on factual tasks (~5% premium gap, statistically significant across all configurations).
- Math correctness remains publicly observable. In mathematical reasoning, no premium gap emerges at any layer depth — suggesting correctness is governed by problem structure rather than model-specific retrieval.
- The signal builds through depth. Layer-wise analysis shows the factual advantage emerges progressively from early-to-mid layers onward, consistent with idiosyncratic memory retrieval accumulating through the forward pass.
Accepted at ACL 2026 (Main Conference)
Authors: Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, and Yonatan Belinkov
Get this paper in your agent:
hf papers read 2604.12373 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper


