TokenHD-4B / README.md
mr233's picture
Add arxiv paper link
7a2880f verified
---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
- hallucination-detection
- token-classification
- qwen3
language:
- en
---
# TokenHD-4B
**TokenHD** is a token-level hallucination detector trained on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation.
Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384)
Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD)
Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data)
---
## Model Details
| Property | Value |
|---|---|
| Base model | `Qwen/Qwen3-4B` |
| Architecture | `AutoModelForTokenClassification` (`num_labels=1`) |
| Training domain | Mathematics (competition-level problems) |
| Output | Per-token hallucination probability (sigmoid of logits) |
---
## Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_id = "mr233/TokenHD-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1)
model.eval()
problem = "What is the capital of France?"
response = "The capital of France is London."
messages = [
{"role": "user", "content": problem},
{"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2]
input_tensor = torch.tensor(input_ids).unsqueeze(0)
with torch.no_grad():
logits = model(input_ids=input_tensor).logits # shape: (1, seq_len, 1)
# scores for response tokens only
response_ids = tokenizer.encode(response, add_special_tokens=False)
scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):]
# scores[i] is the hallucination probability for the i-th response token
```
---
## Evaluation
TokenHD models are evaluated with two metrics:
- **S_incor**: Token-level F1 on hallucinated (incorrect) responses — measures how precisely the detector localizes errors.
- **S_cor**: Recall on hallucination-free (correct) responses — measures how rarely the detector raises false alarms.