YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

PHASE 1 RESEARCH ARTIFACT β€” detects model confidence, not factual correctness

This model was trained on LLM hidden-state activations to produce an energy score that correlates with the model's output confidence (hallucination likelihood). It cannot verify whether a model's answer is factually correct β€” it can only signal how uncertain the model appears token-by-token.

This limitation was confirmed in Exp 184/203: the energy scores reflect model confidence, not answer correctness. Do not use these scores as a correctness verifier.

For production use, install the full Carnot pipeline:

pip install carnot

The production pipeline includes FormalClaimVerifier (solver-routed formal claim verification), PBT code verification, and the Carnot MCP server. See Carnot on GitHub for documentation.

Exp 316 Full-Scale Benchmark Results (2026-04-14)

The Carnot FCV pipeline was benchmarked on 400 GSM8K questions (adversarial corpus with number_swap and irrelevant_sentence perturbations) and 50 HumanEval problems.

Baseline accuracy on adversarial GSM8K (no Carnot intervention):

Model GSM8K Accuracy 95% CI N
Gemma4-E4B-it 26.3% [22.2%, 30.8%] 400
Qwen3.5-0.8B 27.5% [23.4%, 32.1%] 400

Note: inference_mode=simulated. Live GPU results pending. See results/experiment_316_fullscale_results.json for full details.


tags: - energy-based-model - hallucination-detection - jax - carnot license: apache-2.0

⚠️ PHASE 1 RESEARCH ARTIFACT

This model detects output confidence (hallucination likelihood signals from LLM hidden-state activations), not correctness. It cannot verify whether a model's answer is right β€” it can only signal how uncertain the model appears token-by-token.

For production use, install the full Carnot pipeline which includes FormalClaimVerifier (solver-routed formal claim verification), PBT code verification (property-based testing on 164-problem HumanEval), process integrity detection (right-for-wrong-reasons), and the Carnot MCP server:

pip install carnot

See Carnot on GitHub for documentation and the full production API.

Note: These are Phase 1 research artifacts β€” per-token activation EBMs that detect hallucination confidence signals from LLM hidden states. For production use of the full Carnot EBM framework (constraint verification, guided decoding, energy-based repair), see:

pip install carnot

Source and documentation: https://github.com/ianblenke/carnot

Important: Research Artifact, Not a Production Detector

This model achieves 68.3% on held-out TruthfulQA test sets, but in practical deployment (8 real questions), activation-based EBMs agreed with ground truth only 50% of the time. The EBM detects model confidence, not correctness β€” confident hallucinations get low energy (look fine) while correct-but-hedging answers get flagged.

This model is a research artifact documenting activation-space structure. It is NOT a reliable hallucination detector for production use.

For practical verification, use structural constraints (test execution, SAT solving) rather than activation analysis. See the Carnot technical report for 41 experiments and 14 principles learned.

per-token-ebm-qwen35-08b-think

Per-token hallucination detection EBM trained on Qwen3.5-0.8B activations with thinking (chain-of-thought) enabled. Lower accuracy (67.2%) because thinking compresses the hallucination signal (Principle 10).

Key Stats

Metric Value
Test accuracy 68.3%
Energy gap 0.7477
Source model Qwen/Qwen3.5-0.8B
Thinking mode enabled
Training tokens 52,296
Architecture Gibbs [1024 β†’ 256 β†’ 64 β†’ 1], SiLU

Usage

from safetensors.numpy import load_file
import jax.numpy as jnp
import jax.random as jrandom
from carnot.models.gibbs import GibbsConfig, GibbsModel

# Load weights
weights = load_file("per-token-ebm-qwen35-08b-think/model.safetensors")
config = GibbsConfig(input_dim=1024, hidden_dims=[256, 64], activation="silu")
ebm = GibbsModel(config, key=jrandom.PRNGKey(0))

# Set weights
ebm.layers = [
    (jnp.array(weights["layer_0_weight"]), jnp.array(weights["layer_0_bias"])),
    (jnp.array(weights["layer_1_weight"]), jnp.array(weights["layer_1_bias"])),
]
ebm.output_weight = jnp.array(weights["output_weight"])
ebm.output_bias = jnp.array(weights["output_bias"])

# Score an activation vector (from Qwen/Qwen3.5-0.8B hidden states)
energy = float(ebm.energy(activation_vector))
# Low energy = likely correct, high energy = likely hallucination

Training

  • Loss: Noise Contrastive Estimation (NCE)
  • Epochs: 300, lr=0.005
  • Data: 52,296 per-token activations from Qwen/Qwen3.5-0.8B
  • Labels: correct answer tokens = low energy (data), wrong answer tokens = high energy (noise)

Limitations

  • Only works with activations from Qwen/Qwen3.5-0.8B β€” different models have different representation spaces
  • 68.3% accuracy β€” use as one signal among many, not as sole verification
  • Trained on QA/TruthfulQA β€” may not generalize to all domains

10 Principles from Carnot Research

  1. Simpler is better in small-data regimes
  2. Token-level features > sequence-level
  3. The model's own logprobs are the best energy
  4. Overfitting is the main enemy
  5. Extract features from generated tokens, not prompts
  6. Different energy signals dominate in different domains
  7. Statistical difference β‰  causal influence
  8. Instruction tuning compresses the hallucination signal
  9. Adversarial questions defeat post-hoc detection
  10. Chain-of-thought compresses the hallucination signal

See the Carnot technical report for all 25 experiments.

What's Proven to Work (2026)

The following Carnot pipeline capabilities have been validated with live GPU inference (not simulation) as of April 2026. Install via pip install carnot.

Capability What it does Evidence
FormalClaimVerifier Solver-routed formal claim verification: arithmetic, boolean-entailment, set-membership, execution-oracle, cardinality, comparison routes 1,243 solver-routable rows from live GSM8K + HumanEval traces (Exp 244/246)
PBT code verification Property-based testing (Hypothesis) catches bugs that official test suites miss +3.0pp on 164-problem HumanEval with Gemma4-E4B-it (Exp 226); 2 official-test misses caught on Qwen3.5-0.8B (Exp 227)
Process integrity detection Detects right-for-wrong-reasons answers where the output is correct but the reasoning process is invalid 5 right-for-wrong-reasons cases caught across 30-case HumanEval cohort (Exp 251)
Carnot MCP server Exposes verify_code_with_pbt and 6 other tools to any MCP-compatible agent 7 discoverable tools, 30s timeout, 10K input guard (VERIFY-031)

These results use instruction-tuned models (Gemma4-E4B-it, Qwen3.5-0.8B) on live CUDA hardware. All per-token EBM confidence results (this model family) are Phase 1 research artifacts and should not be interpreted as correctness scores.

Downloads last month
56
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support