bsanch52's picture
Update metadata with huggingface_hub
484b1de verified
metadata
library_name: confidence-cartography
tags:
  - confidence-cartography
  - interpretability
  - causal-lm
  - confidence-calibration
  - mandela-effect
  - false-belief-detection
  - teacher-forcing
  - rho-eval
  - alignment
  - rho-guided-sft
  - contrastive-loss
  - calibration-repair
  - behavioral-audit
  - steering-vectors
  - mechanistic-interpretability
  - fidelity-bench
  - pythia
  - llama
  - mistral
  - qwen
  - gpt2
  - pytorch
  - transformers
  - en
license: mit
language:
  - en
model-index:
  - name: Confidence Cartography
    results:
      - task:
          type: text-classification
          name: False-Belief Detection (Mandela Effect)
        dataset:
          name: YouGov Mandela Effect Survey (2022)
          type: custom
        metrics:
          - type: spearman_correlation
            value: 0.652
            name: Spearman Correlation (human false-belief prevalence)
            verified: false
          - type: p_value
            value: 0.016
            name: p-value
            verified: false
          - type: accuracy
            value: 0.88
            name: Accuracy
            verified: false

Confidence Cartography

Teacher-forced confidence analysis for causal language models.

Measures P(actual_token | preceding_context) at every position in a text using a single forward pass. The resulting per-token probability map detects false beliefs, Mandela effects, and medical misinformation by comparing confidence on true vs. false versions of claims.

Paper: DOI: 10.5281/zenodo.18703506

Toolkit: confidence-cartography-toolkit (pip-installable)

Paper repo: confidence-cartography (experiments + manuscript)

Key Results

  • Mandela Effect calibration: Spearman rho = 0.652 (p = 0.016) — model confidence ratios track human false-belief prevalence across Pythia 160M to 12B
  • Medical misinformation detection: 88% accuracy at Pythia 6.9B scale (p = 0.01 vs random)
  • Scaling: Detection improves monotonically with model scale across the Pythia family
  • Targeted resampling: 3-5x cheaper than uniform best-of-N by regenerating only from low-confidence positions

Install

pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git

Reproduce the Results in 3 Lines

import confidence_cartography as cc

# Mandela Effect calibration (9 items, YouGov ground truth)
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(f"Spearman rho: {results.rho:.3f}, p={results.p_value:.4f}")
# -> rho=0.652, p=0.016

# Medical myth detection (25 curated pairs)
results = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")
print(f"Accuracy: {results.accuracy:.0%}")
# -> 88%

Works With Any Causal LM

Tested on:

Architecture Models Tested
Pythia 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
GPT-2 124M
Qwen 2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Llama 3.1 8B
Mistral 7B

Any HuggingFace AutoModelForCausalLM-compatible model works:

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("your-model-here")
result = scorer.score("The capital of France is Paris.")
print(f"Mean confidence: {result.mean_top1_prob:.1%}")

How It Works

  1. Tokenize the input text
  2. Run a single forward pass through the causal LM
  3. At each position t, record P(token[t+1] | tokens[0:t]) — the probability the model assigns to the actual next token
  4. Compute summary statistics: mean confidence, entropy, minimum-confidence position
  5. Check against 15 Mandela Effect patterns and 15 medical myth patterns

This is teacher forcing — we feed the actual text and measure how surprised the model is at each token, rather than letting the model generate freely. The key insight is that false beliefs that are widely shared in the training data (like "Luke, I am your father") receive higher confidence than the correct versions, and this effect correlates with human survey data.

API

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("gpt2")

# Score any text
result = scorer.score("Vaccines cause autism.")
print(result.mean_top1_prob)   # overall confidence
print(result.mean_entropy)     # uncertainty (bits)
print(result.min_confidence_token)  # weakest token

# Score + flag known false beliefs
record, flags = scorer.score_and_flag("The Berenstein Bears")
# flags: ['mandela_match:berenstain']

# Compare two versions
result = scorer.compare("Berenstain Bears", "Berenstein Bears")
print(result["confidence_ratio"])

# Reproduce paper benchmarks
import confidence_cartography as cc
mandela = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
medical = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")

Citation

@software{sanchez2026confidence,
  author = {Sanchez, Bryan},
  title = {Confidence Cartography: Using Language Models as Sensors for the Structure of Human Knowledge},
  year = {2026},
  doi = {10.5281/zenodo.18703506},
  url = {https://github.com/SolomonB14D3/confidence-cartography}
}

License

MIT