ThinkTank PRM β Process Reward Model for Reasoning Efficiency
A reward model that scores reasoning steps as useful or wasteful.
Trained on crowdsourced human judgments from ThinkTank, a Game With A Purpose where players identify wasteful steps in AI reasoning chains.
Results
| Metric | Value |
|---|---|
| Pairwise accuracy | 95.7% |
| Eval loss | 0.071 |
| Training pairs | 92 |
| Eval pairs | 23 |
| Training time | 105 seconds |
Usage
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel
# Load
tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
model.eval()
# Score a reasoning step
text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
score = model(**inputs).logits.item()
print(f"Score: {score:.3f}") # Positive = useful, negative = wasteful
Example Scores
| Step Type | Content | Score | Label |
|---|---|---|---|
| thinking | "I need to find 25% of 200..." | -0.33 | WASTEFUL |
| calculation | "25/100 = 0.25. 0.25 * 200 = 50" | +3.21 | USEFUL |
| conclusion | "The answer is 50" | +3.25 | USEFUL |
| verification | "Let me double-check: 200/4 = 50" | +1.08 | USEFUL |
Training Details
- Base model: Qwen/Qwen2.5-0.5B
- Method: LoRA (r=16, alpha=32, dropout=0.1)
- Target modules: q_proj, v_proj + score head
- Epochs: 5
- Learning rate: 1e-4
- Hardware: Apple M4 (MPS), 105 seconds total
The Pipeline
ThinkTank GWAP (19 users, 206 judgments)
β Consensus labels (165 steps)
β Reward pairs (115 chosen/rejected)
β This PRM (95.7% accuracy)
β Score any LLM reasoning chain
Links
- ThinkTank Game β Play and contribute labels
- Step Labels Dataset
- Reward Pairs Dataset
Citation
@misc{thinktank-prm-2026,
title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
author={Ha Le},
year={2026},
url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
}
- Downloads last month
- 3
Model tree for vanthienha199/thinktank-prm-qwen2.5-0.5b
Base model
Qwen/Qwen2.5-0.5B