yafitzdev's picture
Cross-link to yafitzdev/fitz-gov dataset (now live on HF)
54ab7af verified
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
language:
- en
base_model: answerdotai/ModernBERT-base
tags:
- rag
- governance
- hallucination-detection
- epistemic-honesty
- classification
- fitz-gov
- pyrrho
datasets:
- yafitzdev/fitz-gov
metrics:
- accuracy
- f1
- false-trustworthy-rate
---
# pyrrho-modernbert-base-v1
> Decide whether your retrieved sources support a confident answer, contradict each other, or simply don't contain it — **without an LLM call**.
This is a fine-tune of [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on [fitz-gov](https://github.com/yafitzdev/fitz-gov) V5.1 for **3-class RAG governance classification**: given a `(query, retrieved contexts)` pair, predicts one of:
| Verdict | Meaning |
|---|---|
| `ABSTAIN` | The sources do not contain enough information to answer. |
| `DISPUTED` | The sources contradict each other on the answer. |
| `TRUSTWORTHY` | The sources consistently and sufficiently support an answer. |
A drop-in replacement for the constraint+sklearn governance pipeline in [fitz-sage](https://github.com/yafitzdev/fitz-sage). Single forward pass, ~30 ms on CPU after INT8 ONNX quantization, no external LLM dependency.
---
## Results
Validated on the [fitz-gov](https://github.com/yafitzdev/fitz-gov) V5.1 eval split (584 cases, stratified 20% hold-out from `tier1_core`). All numbers are **3-seed mean ± std** across seeds [42, 1337, 7].
| Metric | pyrrho v1 | fitz-sage v0.11 (sklearn baseline) | Δ |
|---|---|---|---|
| Overall accuracy (calibrated) | **86.13 ± 0.86** | 78.7 | **+7.43** |
| False-trustworthy rate (safety) | **5.27 ± 0.21** | 5.7 | **-0.43** (safer) |
| Trustworthy recall | **79.38 ± 1.64** | 70.0 | **+9.38** |
| Disputed recall | **94.81 ± 1.28** | 86.1 | **+8.71** |
| Abstain recall | **92.94 ± 1.11** | 86.5 | **+6.44** |
| Macro F1 | 86.10 ± 0.80 | n/a | — |
---
## Known limitations
1. **Multi-source-convergence cases can be misclassified as DISPUTED.** When multiple authoritative sources state the same fact with slight numerical variation that falls within measurement tolerance (e.g., 4 climate agencies citing 1.09–1.20 °C of warming, or NIST and IUPAC both giving the speed of light), the model occasionally classifies the case as DISPUTED with high confidence. On the relevant fitz-gov subcategory (`multi_source_convergence`, n=7) the error rate is ~57%. A v2 release with augmented training data targeting this pattern is planned.
2. **Short, direct factual contexts can trigger over-abstention.** Smoke-test example: query *"When was the iPhone released?"* + a single-sentence context confirming June 29, 2007 → predicted `ABSTAIN` with P(ABSTAIN)=0.92. The model was trained on 62.7% hard tier1 cases (rich methodological contexts), so it underweights the short-clean-answer pattern. Production RAG chunks (typically 200–500 chars) are tier1-like and largely unaffected.
---
## Usage
### Direct (transformers)
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = AutoModelForSequenceClassification.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1").eval()
query = "Has the company achieved profitability?"
contexts = [
"The company posted its first profitable quarter, with net income of $4 million.",
"The company recorded a quarterly loss of $12 million, the third consecutive losing quarter.",
]
# Build the input the same way training data was formatted
text = f"Question: {query}\n\nSources:\n" + "\n".join(
f"[{i}] {c}" for i, c in enumerate(contexts, start=1)
)
enc = tokenizer(text, truncation=True, max_length=4096, return_tensors="pt")
with torch.no_grad():
logits = model(**enc).logits[0]
probs = torch.softmax(logits, dim=-1).numpy()
labels = ["ABSTAIN", "DISPUTED", "TRUSTWORTHY"]
print(f"Predicted: {labels[int(probs.argmax())]}")
print(f"Probs : A={probs[0]:.3f} D={probs[1]:.3f} T={probs[2]:.3f}")
```
### CPU-optimized (ONNX + INT8)
For production CPU inference at ~30 ms / case, load the INT8 ONNX variant via `optimum`:
```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("yafitzdev/pyrrho-modernbert-base-v1")
model = ORTModelForSequenceClassification.from_pretrained(
"yafitzdev/pyrrho-modernbert-base-v1",
file_name="model_quantized.onnx",
)
# Same input format as above...
```
### Calibrated decision rule
The headline numbers above use **threshold calibration** on the TRUSTWORTHY softmax probability. To match the published numbers, fall back from `TRUSTWORTHY` to the runner-up class when `P(TRUSTWORTHY) < tau`. The per-seed selected `tau` varied across runs (0.34–0.62); the safest default is `tau = 0.50`.
```python
TAU = 0.50
pred = int(probs.argmax())
if pred == 2 and probs[2] < TAU: # TRUSTWORTHY id is 2
pred = int(probs[:2].argmax()) # fall back to runner-up between ABSTAIN/DISPUTED
```
---
## Training
| Hyperparameter | Value |
|---|---|
| Base model | `answerdotai/ModernBERT-base` |
| Architecture | ModernBERT (sequence classification head) |
| Labels (3-class) | ABSTAIN (0), DISPUTED (1), TRUSTWORTHY (2) |
| Max sequence length | 4096 tokens |
| Epochs | 5 (with early stopping, patience 2) |
| Per-device batch size | 16 |
| Effective batch size | 16 |
| Learning rate | 5e-5 |
| LR scheduler | cosine, 10% warmup |
| Weight decay | 0.01 |
| Label smoothing | 0.15 |
| Class weights | [2.3, 2.3, 1.0] (counters TRUSTWORTHY-over-prediction from 53% class imbalance) |
| Loss | Weighted cross-entropy + label smoothing |
| Selection metric | `ft_penalized_accuracy = accuracy - 3 * max(0, FT - 0.057)` |
| Optimizer | adamw_torch_fused (bf16) |
| Hardware | NVIDIA RTX 5090 (Blackwell sm_120) |
| Training time | ~80–500 s per run depending on GPU contention |
Training data: fitz-gov V5.1 `tier1_core`, stratified 80/20 split by `(label, difficulty)` for train/eval. The 60-case `tier0_sanity` set is held out separately as a noise-prone diagnostic.
---
## Dataset
This model is trained and evaluated on [**fitz-gov V5.1**](https://github.com/yafitzdev/fitz-gov), a 2,980-case benchmark for RAG governance (epistemic honesty). The eval split (584 cases) is a stratified 20% hold-out from `tier1_core` (2,920 cases, 62.7% hard difficulty, 17 domains, 113+ subcategories).
fitz-gov commit at training time: `3e1d22e22fdff726330a0d70503b07f73dacf817`
---
## Limitations & intended use
**Intended use:** as a CPU-friendly governance head inside a RAG pipeline that needs to decide when to answer, abstain, or flag a dispute. Drop-in replacement for the constraint+sklearn cascade in [fitz-sage](https://github.com/yafitzdev/fitz-sage).
**Not intended for:**
- Generating answers (this is a classification model, not a generator).
- Token-level hallucination localization (see [LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) for that — complementary use).
- Languages other than English. fitz-gov is English-only; multilingual variants are a v3+ consideration.
**Safety axis:** the false-trustworthy rate is the production safety metric (a case wrongly classified as `TRUSTWORTHY` is the dangerous error — the system would confidently surface a hallucinated or unsupported answer). Threshold calibration is tuned to keep this rate at or below the fitz-sage baseline (5.7%).
---
## Citation
```bibtex
@misc{pyrrho_v1_2026,
title = { pyrrho-modernbert-base-v1 },
author = { Yan Fitzner },
year = { 2026 },
url = { https://huggingface.co/yafitzdev/pyrrho-modernbert-base-v1 },
}
```
## License
Apache 2.0 — see [LICENSE](https://github.com/yafitzdev/pyrrho/blob/main/LICENSE).
## Related projects
- [**fitz-sage**](https://github.com/yafitzdev/fitz-sage) — production RAG library that uses this model.
- [**fitz-gov**](https://github.com/yafitzdev/fitz-gov) — the benchmark dataset.
- [**pyrrho**](https://github.com/yafitzdev/pyrrho) — training code and roadmap for the full model family.