Upload README.md with huggingface_hub

86902ed verified about 7 hours ago

4.66 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-8B
	tags:
	- hallucination-detection
	- token-classification
	- qwen3
	language:
	- en
	---

	# TokenHD-8B (multi-domain)

	TokenHD is a token-level hallucination detector trained on top of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation.

	Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384)
	Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD)
	Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data)

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen3-8B` \|
	\| Architecture \| `AutoModelForTokenClassification` (`num_labels=1`) \|
	\| Training domain \| Mathematics and code generation (multi-domain training) \|
	\| Output \| Per-token hallucination probability (sigmoid of logits) \|

	---

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_id = "mr233/TokenHD-8B-Mix"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1)
	model.eval()

	problem = "What is the capital of France?"
	response = "The capital of France is London."

	messages = [
	{"role": "user", "content": problem},
	{"role": "assistant", "content": response},
	]
	input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2]
	input_tensor = torch.tensor(input_ids).unsqueeze(0)

	with torch.no_grad():
	logits = model(input_ids=input_tensor).logits # shape: (1, seq_len, 1)

	# scores for response tokens only
	response_ids = tokenizer.encode(response, add_special_tokens=False)
	scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):]
	# scores[i] is the hallucination probability for the i-th response token
	```

	---

	## Evaluation

	Use the [TokenHD eval dataset](https://huggingface.co/datasets/mr233/TokenHD-eval-data) to compute S_incor (token F1 on hallucinated samples) and S_cor (recall on hallucination-free samples):

	```python
	from datasets import load_dataset
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch
	import numpy as np

	def hard_f1(y_true, y_pred):
	if max(y_true) == 0:
	y_true, y_pred = 1 - y_true, 1 - y_pred
	tp = np.sum((y_pred == 1) & (y_true == 1))
	fp = np.sum((y_pred == 1) & (y_true == 0))
	fn = np.sum((y_pred == 0) & (y_true == 1))
	precision = tp / (tp + fp + 1e-7)
	recall = tp / (tp + fn + 1e-7)
	f1 = 2 * precision * recall / (precision + recall + 1e-7)
	return precision, recall, f1

	model_id = "mr233/TokenHD-8B-Mix"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForTokenClassification.from_pretrained(
	model_id, num_labels=1, torch_dtype=torch.bfloat16, device_map="auto"
	)
	model.eval()

	benchmarks = [
	"tokenhd_eval_math_500",
	"tokenhd_eval_math_aime",
	"tokenhd_eval_math_gpqa",
	"tokenhd_eval_math_fin_qa",
	"tokenhd_eval_math_olym",
	"tokenhd_eval_math_olym_phy",
	"tokenhd_eval_code_codeelo",
	"tokenhd_eval_code_live_code_lite",
	]

	for bench in benchmarks:
	dataset = load_dataset("mr233/TokenHD-eval-data",
	data_files=f"{bench}.jsonl", split="train")
	f1_incor, f1_cor = [], []
	for item in dataset:
	token_weights_gt = np.array(item["token_weights"], dtype=np.float32)
	gt_hard = (token_weights_gt > 0.5).astype(np.float32)

	messages = [{"role": "user", "content": item["problem"]},
	{"role": "assistant", "content": item["raw_answer"]}]
	input_ids = tokenizer.apply_chat_template(
	messages, tokenize=True, add_generation_prompt=False)[:-2]
	input_tensor = torch.tensor(input_ids, device=model.device).unsqueeze(0)

	with torch.no_grad():
	logits = model(input_ids=input_tensor).logits
	scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(gt_hard):]
	pred_hard = (scores.float().cpu().numpy() > 0.5).astype(np.float32)

	_, _, f1 = hard_f1(gt_hard, pred_hard)
	if item["correctness"] == -1:
	f1_incor.append(f1)
	else:
	f1_cor.append(f1)

	s_incor = np.mean(f1_incor) * 100 if f1_incor else float("nan")
	s_cor = np.mean(f1_cor) * 100 if f1_cor else float("nan")
	print(f"{bench:<44s} S_incor={s_incor:.2f} S_cor={s_cor:.2f}")
	```