Update metadata with huggingface_hub

484b1de verified about 2 months ago

5.18 kB

library_name: confidence-cartography
tags:
  - confidence-cartography
  - interpretability
  - causal-lm
  - confidence-calibration
  - mandela-effect
  - false-belief-detection
  - teacher-forcing
  - rho-eval
  - alignment
  - rho-guided-sft
  - contrastive-loss
  - calibration-repair
  - behavioral-audit
  - steering-vectors
  - mechanistic-interpretability
  - fidelity-bench
  - pythia
  - llama
  - mistral
  - qwen
  - gpt2
  - pytorch
  - transformers
  - en
license: mit
language:
  - en
model-index:
  - name: Confidence Cartography
    results:
      - task:
          type: text-classification
          name: False-Belief Detection (Mandela Effect)
        dataset:
          name: YouGov Mandela Effect Survey (2022)
          type: custom
        metrics:
          - type: spearman_correlation
            value: 0.652
            name: Spearman Correlation (human false-belief prevalence)
            verified: false
          - type: p_value
            value: 0.016
            name: p-value
            verified: false
          - type: accuracy
            value: 0.88
            name: Accuracy
            verified: false

Confidence Cartography

Teacher-forced confidence analysis for causal language models.

Measures P(actual_token | preceding_context) at every position in a text using a single forward pass. The resulting per-token probability map detects false beliefs, Mandela effects, and medical misinformation by comparing confidence on true vs. false versions of claims.

Paper: DOI: 10.5281/zenodo.18703506

Toolkit: confidence-cartography-toolkit (pip-installable)

Paper repo: confidence-cartography (experiments + manuscript)

Key Results

Mandela Effect calibration: Spearman rho = 0.652 (p = 0.016) — model confidence ratios track human false-belief prevalence across Pythia 160M to 12B
Medical misinformation detection: 88% accuracy at Pythia 6.9B scale (p = 0.01 vs random)
Scaling: Detection improves monotonically with model scale across the Pythia family
Targeted resampling: 3-5x cheaper than uniform best-of-N by regenerating only from low-confidence positions

Install

pip install git+https://github.com/SolomonB14D3/confidence-cartography-toolkit.git

Reproduce the Results in 3 Lines

import confidence_cartography as cc

# Mandela Effect calibration (9 items, YouGov ground truth)
results = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
print(f"Spearman rho: {results.rho:.3f}, p={results.p_value:.4f}")
# -> rho=0.652, p=0.016

# Medical myth detection (25 curated pairs)
results = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")
print(f"Accuracy: {results.accuracy:.0%}")
# -> 88%

Works With Any Causal LM

Tested on:

Architecture	Models Tested
Pythia	160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
GPT-2	124M
Qwen 2.5	0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
Llama 3.1	8B
Mistral	7B

Any HuggingFace AutoModelForCausalLM-compatible model works:

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("your-model-here")
result = scorer.score("The capital of France is Paris.")
print(f"Mean confidence: {result.mean_top1_prob:.1%}")

How It Works

Tokenize the input text
Run a single forward pass through the causal LM
At each position t, record P(token[t+1] | tokens[0:t]) — the probability the model assigns to the actual next token
Compute summary statistics: mean confidence, entropy, minimum-confidence position
Check against 15 Mandela Effect patterns and 15 medical myth patterns

This is teacher forcing — we feed the actual text and measure how surprised the model is at each token, rather than letting the model generate freely. The key insight is that false beliefs that are widely shared in the training data (like "Luke, I am your father") receive higher confidence than the correct versions, and this effect correlates with human survey data.

API

from confidence_cartography import ConfidenceScorer

scorer = ConfidenceScorer("gpt2")

# Score any text
result = scorer.score("Vaccines cause autism.")
print(result.mean_top1_prob)   # overall confidence
print(result.mean_entropy)     # uncertainty (bits)
print(result.min_confidence_token)  # weakest token

# Score + flag known false beliefs
record, flags = scorer.score_and_flag("The Berenstein Bears")
# flags: ['mandela_match:berenstain']

# Compare two versions
result = scorer.compare("Berenstain Bears", "Berenstein Bears")
print(result["confidence_ratio"])

# Reproduce paper benchmarks
import confidence_cartography as cc
mandela = cc.evaluate_mandela_effect("EleutherAI/pythia-6.9b")
medical = cc.evaluate_medical_myths("EleutherAI/pythia-6.9b")

Citation

@software{sanchez2026confidence,
  author = {Sanchez, Bryan},
  title = {Confidence Cartography: Using Language Models as Sensors for the Structure of Human Knowledge},
  year = {2026},
  doi = {10.5281/zenodo.18703506},
  url = {https://github.com/SolomonB14D3/confidence-cartography}
}

License

MIT