Uploaded finetuned model
- Developed by: ikedachin
- Model Type: Causal Language Model (RLVR - Reinforcement Learning from Verifiable Rewards)
- Base Model: Qwen/Qwen2.5-3B-Instruct
- Language(s): Multilingual (primarily tested on English/Japanese/Reasoning tasks)
- License: Apache 2.0
- Training Algorithm: GRPO (via TRL and Unsloth)
- Purpose: Environment Validation / Proof of Concept
- This model was primarily developed to verify the successful setup and integration of the ThinkStation PGX(GB10 Blackwell) local environment with the Unsloth/GRPO training pipeline.
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
Model Card for Qwen2.5-3B-GRPO-ThinkStation
This model is a fine-tuned version of Qwen/Qwen2.5-3B-Instruct using DeepMind's GRPO (Group Relative Policy Optimization) algorithm.
The training was conducted on a Lenovo ThinkStation PGX workstation, moving away from cloud-based environments like Google Colab to leverage local high-performance hardware.
Training Hardware & Environment
Unlike typical notebook-based training, this model was trained on a local professional workstation:
- Workstation: Lenovo ThinkStation PGX
- Optimization Library: Unsloth (for 2x faster training and 70% less memory usage)
- Framework: PyTorch, TRL, PEFT
Training Infrastructure (ThinkStation PGX)
The use of the ThinkStation PGX allows for:
- Stable, long-term training sessions without session timeouts.
- High-speed NVMe storage for fast checkpointing.
- Efficient thermal management for sustained GPU performance during RL epochs.
Training Procedure
The model was trained using the GRPO algorithm, which optimizes the policy by comparing a group of outputs against each other based on reward functions (e.g., accuracy for math/logic or format consistency) without requiring a separate value function (critic model).
Training Hyperparameters
- Learning Rate: 5e-6
- Batch Size: 1
- Gradient Accumulation Steps: 1
- Num Generations (G): 8
- Max Sequence Length: 2048
- Optimizer: AdamW_8bit
How to use
You can use this model via unsloth or standard transformers:
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "ikedachin/qwen_finetune_16bit_unsloth_gb10",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
# Test prompt
inputs = tokenizer([
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
"<|im_start|>user\nCalculate 2+2 and explain why.<|im_end|>\n"
"<|im_start|>assistant\n<|thought|>\n"
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 512)
print(tokenizer.decode(outputs[0]))
Acknowledgements
- This model was trained using the Unsloth library.
- Training was performed on a ThinkStation PGX workstation.
- Built upon the Qwen 2.5 architecture.
Purpose of this Project
This model is a byproduct of environment verification. I aimed to confirm that the ThinkStation PGX can handle the heavy computation required for GRPO (Group Relative Policy Optimization).
For the detailed verification process and hardware setup, please see my Qiita article(Japanese Only).
- Downloads last month
- 4
