| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-4B |
| tags: |
| - hallucination-detection |
| - token-classification |
| - qwen3 |
| language: |
| - en |
| --- |
| |
| # TokenHD-4B |
|
|
| **TokenHD** is a token-level hallucination detector trained on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) using the TokenHD pipeline. It assigns a hallucination probability to each token in an LLM-generated response, enabling fine-grained localization of errors without requiring predefined step segmentation. |
|
|
| Paper: [arxiv.org/abs/2605.12384](https://arxiv.org/abs/2605.12384) |
| Code: [github.com/rmin2000/TokenHD](https://github.com/rmin2000/TokenHD) |
| Training Data: [mr233/TokenHD-training-data](https://huggingface.co/datasets/mr233/TokenHD-training-data) |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |---|---| |
| | Base model | `Qwen/Qwen3-4B` | |
| | Architecture | `AutoModelForTokenClassification` (`num_labels=1`) | |
| | Training domain | Mathematics (competition-level problems) | |
| | Output | Per-token hallucination probability (sigmoid of logits) | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| model_id = "mr233/TokenHD-4B" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForTokenClassification.from_pretrained(model_id, num_labels=1) |
| model.eval() |
| |
| problem = "What is the capital of France?" |
| response = "The capital of France is London." |
| |
| messages = [ |
| {"role": "user", "content": problem}, |
| {"role": "assistant", "content": response}, |
| ] |
| input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=False)[:-2] |
| input_tensor = torch.tensor(input_ids).unsqueeze(0) |
| |
| with torch.no_grad(): |
| logits = model(input_ids=input_tensor).logits # shape: (1, seq_len, 1) |
| |
| # scores for response tokens only |
| response_ids = tokenizer.encode(response, add_special_tokens=False) |
| scores = torch.sigmoid(logits.squeeze(-1).squeeze(0))[-len(response_ids):] |
| # scores[i] is the hallucination probability for the i-th response token |
| ``` |
|
|
| --- |
|
|
| ## Evaluation |
|
|
| TokenHD models are evaluated with two metrics: |
|
|
| - **S_incor**: Token-level F1 on hallucinated (incorrect) responses — measures how precisely the detector localizes errors. |
| - **S_cor**: Recall on hallucination-free (correct) responses — measures how rarely the detector raises false alarms. |
|
|