File size: 4,562 Bytes
2adb7ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0c0761d
c314160
 
2adb7ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1a1f9f
2adb7ee
 
a1a1f9f
 
 
 
 
 
 
 
 
 
2adb7ee
a1a1f9f
2adb7ee
a1a1f9f
 
 
 
2adb7ee
 
 
 
 
 
7a697e0
57c60a1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7a697e0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57c60a1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
license: apache-2.0
base_model: Qwen/Qwen3-0.6B
tags:
  - hallucination-detection
  - token-classification
  - qwen3
language:
  - en
---

# TokenHD-0.6B

**TokenHD** is a token-level hallucination detector trained on top of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation.

Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384)  
Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD)  
Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data)

---

## Model Details

| Property | Value |
|---|---|
| Base model | `Qwen/Qwen3-0.6B` |
| Architecture | `AutoModelForTokenClassification` (`num_labels=1`) |
| Training domain | Mathematics (competition-level problems) |
| Output | Per-token hallucination probability (sigmoid of logits) |

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "mr233/TokenHD-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1)
model.eval()

problem = "What is the capital of France?"
response = "The capital of France is London."

messages = [
    {"role": "user", "content": problem},
    {"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2]
input_tensor = torch.tensor(input_ids).unsqueeze(0)

with torch.no_grad():
    logits = model(input_ids=input_tensor).logits  # shape: (1, seq_len, 1)

# scores for response tokens only
response_ids = tokenizer.encode(response, add_special_tokens=False)
scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):]
# scores[i] is the hallucination probability for the i-th response token
```

---

## Evaluation

Use the [TokenHD eval dataset](https://huggingface.co/datasets/mr233/TokenHD-eval-data) to compute **S_incor** (token F1 on hallucinated samples) and **S_cor** (recall on hallucination-free samples):

```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

def hard_f1(y_true, y_pred):
    if max(y_true) == 0:
        y_true, y_pred = 1 - y_true, 1 - y_pred
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    precision = tp / (tp + fp + 1e-7)
    recall    = tp / (tp + fn + 1e-7)
    f1        = 2 * precision * recall / (precision + recall + 1e-7)
    return precision, recall, f1

model_id = "mr233/TokenHD-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
    model_id, num_labels=1, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

benchmarks = [
    "tokenhd_eval_math_500",
    "tokenhd_eval_math_aime",
    "tokenhd_eval_math_gpqa",
    "tokenhd_eval_math_fin_qa",
    "tokenhd_eval_math_olym",
    "tokenhd_eval_math_olym_phy",
]

for bench in benchmarks:
    dataset = load_dataset("mr233/TokenHD-eval-data",
                           data_files=f"{bench}.jsonl", split="train")
    f1_incor, f1_cor = [], []
    for item in dataset:
        token_weights_gt = np.array(item["token_weights"], dtype=np.float32)
        gt_hard = (token_weights_gt > 0.5).astype(np.float32)

        messages = [{"role": "user",      "content": item["problem"]},
                    {"role": "assistant", "content": item["raw_answer"]}]
        input_ids = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=False)[:-2]
        input_tensor = torch.tensor(input_ids, device=model.device).unsqueeze(0)

        with torch.no_grad():
            logits = model(input_ids=input_tensor).logits
        scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(gt_hard):]
        pred_hard = (scores.float().cpu().numpy() > 0.5).astype(np.float32)

        _, _, f1 = hard_f1(gt_hard, pred_hard)
        if item["correctness"] == -1:
            f1_incor.append(f1)
        else:
            f1_cor.append(f1)

    s_incor = np.mean(f1_incor) * 100 if f1_incor else float("nan")
    s_cor   = np.mean(f1_cor)   * 100 if f1_cor   else float("nan")
    print(f"{bench:<40s}  S_incor={s_incor:.2f}  S_cor={s_cor:.2f}")
```