YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Dianjin-PRM
Dianjin-PRM is a Process Reward Model built on the Qwen3-8B architecture. It scores each reasoning step in a chain-of-thought trajectory, enabling Best-of-N selection and other process-supervision strategies for financial and mathematical reasoning tasks. This dataset is licensed under CC BY-NC-SA 4.0. The platform license tag may be limited by supported options.
Model Details
| Property | Value |
|---|---|
| Base Architecture | Qwen3-8B (Qwen3ForProcessRewardModel) |
| Parameters | ~8B |
| Precision | bfloat16 |
| Max Sequence Length | 40960 tokens |
| Output Labels | 2 (negative / positive per step) |
| Step Separator Token | <extra_0> |
Requirements
pip install torch transformers
The model uses custom
trust_remote_codeclasses (v1_fin_prm.Qwen3ForProcessRewardModelandv1_fin_config.Qwen3PRMConfig) that are loaded automatically via theauto_mapinconfig.json.
Quick Start
1. Load the Model
import torch
from transformers import AutoModel, AutoTokenizer
MODEL_PATH = "path/to/Dianjin-PRM"
model = AutoModel.from_pretrained(
MODEL_PATH,
trust_remote_code=True,
device_map=None,
).eval()
# Multi-GPU via DataParallel (optional)
model = torch.nn.DataParallel(model).cuda()
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
2. Prepare Input
The model expects input in the following format, with each reasoning step separated by <extra_0>:
##Question
<your question here>
##Thinking Trajectory
<step 1><extra_0><step 2><extra_0>...<step N><extra_0>
Example:
question = "What is the present value of $1000 received in 5 years at a 10% discount rate?"
steps = [
"We need to calculate the present value using the formula PV = FV / (1 + r)^n.",
"Substituting the values: PV = 1000 / (1 + 0.10)^5.",
"PV = 1000 / 1.61051 โ 620.92.",
]
trajectory = "<extra_0>".join(steps) + "<extra_0>"
completion = f"##Question\n{question}\n\n##Thinking Trajectory\n{trajectory}"
3. Compute Step Rewards
def make_step_rewards(logits, token_masks):
"""Extract per-step reward scores from model logits."""
probabilities = torch.nn.functional.softmax(logits, dim=-1)
probabilities = probabilities * token_masks.unsqueeze(-1)
all_scores_res = []
for i in range(probabilities.size(0)):
sample = probabilities[i]
positive_probs = sample[sample != 0].view(-1, 2)[:, 1]
all_scores_res.append(positive_probs.cpu().tolist())
return all_scores_res
# Tokenize
input_ids = tokenizer(
[completion],
return_tensors="pt",
padding=True,
truncation=True,
)["input_ids"].to("cuda")
# Forward pass
with torch.inference_mode():
outputs = model(input_ids=input_ids)
# Build step-separator mask and extract rewards
step_sep_id = tokenizer.encode("<extra_0>")[0]
token_masks = (input_ids == step_sep_id)
step_rewards = make_step_rewards(outputs.logits, token_masks)
print(step_rewards)
# e.g. [[0.92, 0.87, 0.95]] โ one score per step, per sample
Each score is the probability assigned to the positive label at the corresponding <extra_0> step boundary. Higher values indicate higher-quality reasoning steps.
4. Best-of-N Selection
To perform Best-of-N selection over multiple candidate responses:
import numpy as np
candidates = [...] # list of (trajectory_string, final_answer) tuples
all_rewards = []
for trajectory, answer in candidates:
completion = f"##Question\n{question}\n\n##Thinking Trajectory\n{trajectory}"
input_ids = tokenizer(
[completion], return_tensors="pt", padding=True, truncation=True
)["input_ids"].to("cuda")
with torch.inference_mode():
outputs = model(input_ids=input_ids)
step_sep_id = tokenizer.encode("<extra_0>")[0]
token_masks = (input_ids == step_sep_id)
rewards = make_step_rewards(outputs.logits, token_masks)
# Use the minimum step score as the overall trajectory score
all_rewards.append(min(rewards[0]))
best_idx = int(np.argmax(all_rewards))
best_answer = candidates[best_idx][1]
Input Format Summary
| Component | Description |
|---|---|
##Question |
The original question/problem |
##Thinking Trajectory |
Reasoning steps separated by <extra_0> |
<extra_0> |
Special token used as step separator (token id: 151669) |
Notes
- The model outputs 2 logits per token (negative, positive). The reward score for each step is the softmax probability of the positive class at each
<extra_0>position. - For batch inference, pass multiple completions as a list to the tokenizer with
padding=True. - Multi-GPU is supported via
torch.nn.DataParallel. - Always use
trust_remote_code=Truewhen loading the model, as it relies on custom architecture classes.
- Downloads last month
- 25