File size: 4,659 Bytes
74aacff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76b863b
873d76a
 
74aacff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74a47a9
74aacff
 
74a47a9
 
 
 
 
 
 
 
 
 
74aacff
74a47a9
74aacff
74a47a9
 
 
 
74aacff
 
 
 
 
 
86902ed
44625f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86902ed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44625f7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
base_model: Qwen/Qwen3-8B
tags:
  - hallucination-detection
  - token-classification
  - qwen3
language:
  - en
---

# TokenHD-8B (multi-domain)

**TokenHD** is a token-level hallucination detector trained on top of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation.

Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384)  
Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD)  
Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data)

---

## Model Details

| Property | Value |
|---|---|
| Base model | `Qwen/Qwen3-8B` |
| Architecture | `AutoModelForTokenClassification` (`num_labels=1`) |
| Training domain | Mathematics and code generation (multi-domain training) |
| Output | Per-token hallucination probability (sigmoid of logits) |

---

## Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_id = "mr233/TokenHD-8B-Mix"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1)
model.eval()

problem = "What is the capital of France?"
response = "The capital of France is London."

messages = [
    {"role": "user", "content": problem},
    {"role": "assistant", "content": response},
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2]
input_tensor = torch.tensor(input_ids).unsqueeze(0)

with torch.no_grad():
    logits = model(input_ids=input_tensor).logits  # shape: (1, seq_len, 1)

# scores for response tokens only
response_ids = tokenizer.encode(response, add_special_tokens=False)
scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):]
# scores[i] is the hallucination probability for the i-th response token
```

---

## Evaluation

Use the [TokenHD eval dataset](https://huggingface.co/datasets/mr233/TokenHD-eval-data) to compute **S_incor** (token F1 on hallucinated samples) and **S_cor** (recall on hallucination-free samples):

```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
import numpy as np

def hard_f1(y_true, y_pred):
    if max(y_true) == 0:
        y_true, y_pred = 1 - y_true, 1 - y_pred
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    precision = tp / (tp + fp + 1e-7)
    recall    = tp / (tp + fn + 1e-7)
    f1        = 2 * precision * recall / (precision + recall + 1e-7)
    return precision, recall, f1

model_id = "mr233/TokenHD-8B-Mix"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(
    model_id, num_labels=1, torch_dtype=torch.bfloat16, device_map="auto"
)
model.eval()

benchmarks = [
    "tokenhd_eval_math_500",
    "tokenhd_eval_math_aime",
    "tokenhd_eval_math_gpqa",
    "tokenhd_eval_math_fin_qa",
    "tokenhd_eval_math_olym",
    "tokenhd_eval_math_olym_phy",
    "tokenhd_eval_code_codeelo",
    "tokenhd_eval_code_live_code_lite",
]

for bench in benchmarks:
    dataset = load_dataset("mr233/TokenHD-eval-data",
                           data_files=f"{bench}.jsonl", split="train")
    f1_incor, f1_cor = [], []
    for item in dataset:
        token_weights_gt = np.array(item["token_weights"], dtype=np.float32)
        gt_hard = (token_weights_gt > 0.5).astype(np.float32)

        messages = [{"role": "user",      "content": item["problem"]},
                    {"role": "assistant", "content": item["raw_answer"]}]
        input_ids = tokenizer.apply_chat_template(
            messages, tokenize=True, add_generation_prompt=False)[:-2]
        input_tensor = torch.tensor(input_ids, device=model.device).unsqueeze(0)

        with torch.no_grad():
            logits = model(input_ids=input_tensor).logits
        scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(gt_hard):]
        pred_hard = (scores.float().cpu().numpy() > 0.5).astype(np.float32)

        _, _, f1 = hard_f1(gt_hard, pred_hard)
        if item["correctness"] == -1:
            f1_incor.append(f1)
        else:
            f1_cor.append(f1)

    s_incor = np.mean(f1_incor) * 100 if f1_incor else float("nan")
    s_cor   = np.mean(f1_cor)   * 100 if f1_cor   else float("nan")
    print(f"{bench:<44s}  S_incor={s_incor:.2f}  S_cor={s_cor:.2f}")
```