mr233
/

TokenHD-4B

Token Classification

hallucination-detection

Model card Files Files and versions

TokenHD-4B / README.md

mr233's picture

Add arxiv paper link

7a2880f verified about 24 hours ago

|

history blame contribute delete

2.33 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-4B
	tags:
	- hallucination-detection
	- token-classification
	- qwen3
	language:
	- en
	---

	# TokenHD-4B

	TokenHD is a token-level hallucination detector trained on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation.

	Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384)
	Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD)
	Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data)

	---

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base model \| `Qwen/Qwen3-4B` \|
	\| Architecture \| `AutoModelForTokenClassification` (`num_labels=1`) \|
	\| Training domain \| Mathematics (competition-level problems) \|
	\| Output \| Per-token hallucination probability (sigmoid of logits) \|

	---

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	model_id = "mr233/TokenHD-4B"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1)
	model.eval()

	problem = "What is the capital of France?"
	response = "The capital of France is London."

	messages = [
	{"role": "user", "content": problem},
	{"role": "assistant", "content": response},
	]
	input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2]
	input_tensor = torch.tensor(input_ids).unsqueeze(0)

	with torch.no_grad():
	logits = model(input_ids=input_tensor).logits # shape: (1, seq_len, 1)

	# scores for response tokens only
	response_ids = tokenizer.encode(response, add_special_tokens=False)
	scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):]
	# scores[i] is the hallucination probability for the i-th response token
	```

	---

	## Evaluation

	TokenHD models are evaluated with two metrics:

	- S_incor: Token-level F1 on hallucinated (incorrect) responses — measures how precisely the detector localizes errors.
	- S_cor: Recall on hallucination-free (correct) responses — measures how rarely the detector raises false alarms.