Qwen3-0.6B SFT+RL GSM8K Model

This directory contains a Qwen3-0.6B model trained using SFT (Supervised Fine-Tuning) + RL (Reinforcement Learning) methods, specifically optimized for the GSM8K mathematical reasoning task.

Model Information

Base Model: Qwen3-0.6B
Training Method: SFT + RL (GRPO)
Dataset: GSM8K (Grade School Math 8K)
Test Set Accuracy: 0.7938 (79.38%)

Directory Structure

Qwen3-0.6B_sft+rl_merged_model/
├── README.md                    # This file
├── config.json                  # Model configuration file
├── generation_config.json       # Generation configuration file
├── tokenizer_config.json        # Tokenizer configuration
├── tokenizer.json               # Tokenizer file
├── vocab.json                   # Vocabulary file
├── merges.txt                   # BPE merges file
├── gsm8k_test_outputs.jsonl    # Test set output results
├── gsm8k_train_outputs.jsonl   # Training set output results
├── evaluate_accuracy.py         # Accuracy evaluation script
├── collect_model_outputs.py     # Model output collection script
└── utils.py                     # Utility functions

Usage

Loading the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "./Qwen3-0.6B_sft+rl_merged_model"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
    model_path,
    trust_remote_code=True
)

Evaluating Model Accuracy

Use the evaluation script in this directory:

python evaluate_accuracy.py \
    --file gsm8k_test_outputs.jsonl \
    --name "Qwen3-0.6B SFT+RL"

Performance Metrics

Test Set Accuracy: 79.38% (0.7938)
Training Set Accuracy: Detailed results can be found in gsm8k_train_outputs.jsonl

Related Files

Training Scripts: ../train_sft_distillation.py (SFT), ../../train_grpo_gsm8k.py (RL)
Merge Script: ../merge_lora_model.py
Project README: ../README.md

Notes

The model uses BF16 precision and is recommended to run on GPUs that support BF16
The model has merged LoRA weights and can be used directly without loading additional adapters
Test set output results are saved in gsm8k_test_outputs.jsonl, containing detailed reasoning processes for each sample

Citation

If you use this model, please cite the related training methods and datasets:

GSM8K Dataset: Cobbe et al., 2021
GRPO: Group Relative Policy Optimization
Qwen3: Yang et al., 2025. Qwen3 Technical Report. arXiv preprint arXiv:2505.09388

Downloads last month: 3

Safetensors

Model size

0.8B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for pyromind/Qwen3-0.6B-gsm8k

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14, 2025 • 339

Training Verifiers to Solve Math Word Problems

Paper • 2110.14168 • Published Oct 27, 2021 • 7