Uploaded finetuned model

Developed by: ikedachin
Model Type: Causal Language Model (RLVR - Reinforcement Learning from Verifiable Rewards)
Base Model: Qwen/Qwen2.5-3B-Instruct
Language(s): Multilingual (primarily tested on English/Japanese/Reasoning tasks)
License: Apache 2.0
Training Algorithm: GRPO (via TRL and Unsloth)
Purpose: Environment Validation / Proof of Concept
- This model was primarily developed to verify the successful setup and integration of the ThinkStation PGX(GB10 Blackwell) local environment with the Unsloth/GRPO training pipeline.

This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Model Card for Qwen2.5-3B-GRPO-ThinkStation

This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct using DeepMind's GRPO (Group Relative Policy Optimization) algorithm.

The training was conducted on a Lenovo ThinkStation PGX workstation, moving away from cloud-based environments like Google Colab to leverage local high-performance hardware.

Training Hardware & Environment

Unlike typical notebook-based training, this model was trained on a local professional workstation:

Workstation: Lenovo ThinkStation PGX
Optimization Library: Unsloth (for 2x faster training and 70% less memory usage)
Framework: PyTorch, TRL, PEFT

Training Infrastructure (ThinkStation PGX)

The use of the ThinkStation PGX allows for:

Stable, long-term training sessions without session timeouts.
High-speed NVMe storage for fast checkpointing.
Efficient thermal management for sustained GPU performance during RL epochs.

Training Procedure

The model was trained using the GRPO algorithm, which optimizes the policy by comparing a group of outputs against each other based on reward functions (e.g., accuracy for math/logic or format consistency) without requiring a separate value function (critic model).

Training Hyperparameters

Learning Rate: 5e-6
Batch Size: 1
Gradient Accumulation Steps: 1
Num Generations (G): 8
Max Sequence Length: 2048
Optimizer: AdamW_8bit

How to use

You can use this model via unsloth or standard transformers:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "ikedachin/qwen_finetune_16bit_unsloth_gb10",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# Test prompt
inputs = tokenizer([
    "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
    "<|im_start|>user\nCalculate 2+2 and explain why.<|im_end|>\n"
    "<|im_start|>assistant\n<|thought|>\n"
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 512)
print(tokenizer.decode(outputs[0]))

Acknowledgements

This model was trained using the Unsloth library.
Training was performed on a ThinkStation PGX workstation.
Built upon the Qwen 2.5 architecture.

Purpose of this Project

This model is a byproduct of environment verification. I aimed to confirm that the ThinkStation PGX can handle the heavy computation required for GRPO (Group Relative Policy Optimization).

For the detailed verification process and hardware setup, please see my Qiita article(Japanese Only).

Downloads last month: 4

Safetensors

Model size

3B params

Tensor type

BF16

Video Preview

Reinforcement Learning