| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-8B |
| tags: |
| - hallucination-detection |
| - token-classification |
| - qwen3 |
| language: |
| - en |
| --- |
| |
| # TokenHD-8B (multi-domain) |
|
|
| **TokenHD** is a token-level hallucination detector trained on top of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation. |
|
|
| Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384) |
| Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD) |
| Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data) |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | `Qwen/Qwen3-8B` | |
| | Architecture | `AutoModelForTokenClassification` (`num_labels=1`) | |
| | Training domain | Mathematics and code generation (multi-domain training) | |
| | Output | Per-token hallucination probability (sigmoid of logits) | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| model_id = "mr233/TokenHD-8B-Mix" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1) |
| model.eval() |
| |
| problem = "What is the capital of France?" |
| response = "The capital of France is London." |
| |
| messages = [ |
| {"role": "user", "content": problem}, |
| {"role": "assistant", "content": response}, |
| ] |
| input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2] |
| input_tensor = torch.tensor(input_ids).unsqueeze(0) |
| |
| with torch.no_grad(): |
| logits = model(input_ids=input_tensor).logits # shape: (1, seq_len, 1) |
| |
| # scores for response tokens only |
| response_ids = tokenizer.encode(response, add_special_tokens=False) |
| scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):] |
| # scores[i] is the hallucination probability for the i-th response token |
| ``` |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| Use the [TokenHD eval dataset](https://huggingface.co/datasets/mr233/TokenHD-eval-data) to compute **S_incor** (token F1 on hallucinated samples) and **S_cor** (recall on hallucination-free samples): |
|
|
| ```python |
| from datasets import load_dataset |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| import numpy as np |
| |
| def hard_f1(y_true, y_pred): |
| if max(y_true) == 0: |
| y_true, y_pred = 1 - y_true, 1 - y_pred |
| tp = np.sum((y_pred == 1) & (y_true == 1)) |
| fp = np.sum((y_pred == 1) & (y_true == 0)) |
| fn = np.sum((y_pred == 0) & (y_true == 1)) |
| precision = tp / (tp + fp + 1e-7) |
| recall = tp / (tp + fn + 1e-7) |
| f1 = 2 * precision * recall / (precision + recall + 1e-7) |
| return precision, recall, f1 |
| |
| model_id = "mr233/TokenHD-8B-Mix" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForTokenClassification.from_pretrained( |
| model_id, num_labels=1, torch_dtype=torch.bfloat16, device_map="auto" |
| ) |
| model.eval() |
| |
| benchmarks = [ |
| "tokenhd_eval_math_500", |
| "tokenhd_eval_math_aime", |
| "tokenhd_eval_math_gpqa", |
| "tokenhd_eval_math_fin_qa", |
| "tokenhd_eval_math_olym", |
| "tokenhd_eval_math_olym_phy", |
| "tokenhd_eval_code_codeelo", |
| "tokenhd_eval_code_live_code_lite", |
| ] |
| |
| for bench in benchmarks: |
| dataset = load_dataset("mr233/TokenHD-eval-data", |
| data_files=f"{bench}.jsonl", split="train") |
| f1_incor, f1_cor = [], [] |
| for item in dataset: |
| token_weights_gt = np.array(item["token_weights"], dtype=np.float32) |
| gt_hard = (token_weights_gt > 0.5).astype(np.float32) |
| |
| messages = [{"role": "user", "content": item["problem"]}, |
| {"role": "assistant", "content": item["raw_answer"]}] |
| input_ids = tokenizer.apply_chat_template( |
| messages, tokenize=True, add_generation_prompt=False)[:-2] |
| input_tensor = torch.tensor(input_ids, device=model.device).unsqueeze(0) |
| |
| with torch.no_grad(): |
| logits = model(input_ids=input_tensor).logits |
| scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(gt_hard):] |
| pred_hard = (scores.float().cpu().numpy() > 0.5).astype(np.float32) |
| |
| _, _, f1 = hard_f1(gt_hard, pred_hard) |
| if item["correctness"] == -1: |
| f1_incor.append(f1) |
| else: |
| f1_cor.append(f1) |
| |
| s_incor = np.mean(f1_incor) * 100 if f1_incor else float("nan") |
| s_cor = np.mean(f1_cor) * 100 if f1_cor else float("nan") |
| print(f"{bench:<44s} S_incor={s_incor:.2f} S_cor={s_cor:.2f}") |
| ``` |
|
|