ThinkTank PRM β€” Process Reward Model for Reasoning Efficiency

A reward model that scores reasoning steps as useful or wasteful.

Trained on crowdsourced human judgments from ThinkTank, a Game With A Purpose where players identify wasteful steps in AI reasoning chains.

Results

Metric Value
Pairwise accuracy 95.7%
Eval loss 0.071
Training pairs 92
Eval pairs 23
Training time 105 seconds

Usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from peft import PeftModel

# Load
tokenizer = AutoTokenizer.from_pretrained("vanthienha199/thinktank-prm-qwen2.5-0.5b")
base = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B", num_labels=1)
model = PeftModel.from_pretrained(base, "vanthienha199/thinktank-prm-qwen2.5-0.5b")
model.eval()

# Score a reasoning step
text = "Question: What is 25% of 200?\n\nReasoning step (step 3, calculation): 25% = 0.25. 0.25 * 200 = 50."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    score = model(**inputs).logits.item()

print(f"Score: {score:.3f}")  # Positive = useful, negative = wasteful

Example Scores

Step Type Content Score Label
thinking "I need to find 25% of 200..." -0.33 WASTEFUL
calculation "25/100 = 0.25. 0.25 * 200 = 50" +3.21 USEFUL
conclusion "The answer is 50" +3.25 USEFUL
verification "Let me double-check: 200/4 = 50" +1.08 USEFUL

Training Details

  • Base model: Qwen/Qwen2.5-0.5B
  • Method: LoRA (r=16, alpha=32, dropout=0.1)
  • Target modules: q_proj, v_proj + score head
  • Epochs: 5
  • Learning rate: 1e-4
  • Hardware: Apple M4 (MPS), 105 seconds total

The Pipeline

ThinkTank GWAP (19 users, 206 judgments)
    β†’ Consensus labels (165 steps)
    β†’ Reward pairs (115 chosen/rejected)
    β†’ This PRM (95.7% accuracy)
    β†’ Score any LLM reasoning chain

Links

Citation

@misc{thinktank-prm-2026,
  title={ThinkTank PRM: A Process Reward Model Trained on Crowdsourced Reasoning Labels},
  author={Ha Le},
  year={2026},
  url={https://huggingface.co/vanthienha199/thinktank-prm-qwen2.5-0.5b}
}
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vanthienha199/thinktank-prm-qwen2.5-0.5b

Adapter
(380)
this model

Datasets used to train vanthienha199/thinktank-prm-qwen2.5-0.5b