SmolLM3-3B-SFT-GRPO-ES

SmolLM3-3B fine-tuned on ReasonXL and further trained with reinforcement learning (Dr. GRPO) for Spanish reasoning adaptation.


Model Description

This model is the result of a two-stage reasoning adaptation pipeline applied to HuggingFaceTB/SmolLM3-3B:

  1. Stage 1 — SFT (DGurgurov/SmolLM3-3B-SFT-ES): shifts the model's reasoning language from English to Spanish via supervised fine-tuning on Spanish reasoning traces.
  2. Stage 2 — RL (Dr. GRPO): recovers reasoning quality lost during SFT while preserving target-language compliance, using a composite reward over verifiable math problems.
Property Value
Base model HuggingFaceTB/SmolLM3-3B
Training stages SFT → Dr. GRPO
Target language Spanish (es)
SFT data ReasonXL — ES split (~8.7B tokens)
RL data 20K translated MATH + GSM8K prompts (ES)

Training Data

Stage 1: SFT

The model is trained on the Spanish split of ReasonXL, a multilingual cross-domain reasoning corpus of 2M samples per language (9B tokens). English source samples were translated using Qwen3-32B with a dedicated prompt preserving technical terminology, mathematical notation, and reasoning structure.

Source datasets included in ReasonXL:

Dataset Config Samples
Cascade-SFT-Stage-2 general / math 768,615
Dolci-Think-SFT-7B science 347,453
Cascade-SFT-Stage-1 general / code / math / science 711,812
Llama-Nemotron-PTD science 267,147
Nemotron-Science-v1 97,026
Nemotron-IF-Chat-v1 91,151
Total 2,282,204

Stage 2: RL

A per-language set of 20K prompts with verifiable answers, drawn from translated subsets of MATH and GSM8K using the same translation pipeline.


Training Details

Stage 1 — SFT

Completion-only loss; user and system tokens are masked. Sequences are chat-formatted and packed to 16,384 tokens.

Hyperparameter Value
Epochs 2
Max sequence length 16,384
Packing Enabled
Precision bfloat16
Optimizer adamw_torch_fused
Per-device batch size 4
Gradient accumulation 4 steps
Weight decay 0.05
LR scheduler Cosine with min LR
Min LR 5×10⁻⁶
Warmup ratio 0.05
Distributed strategy FSDP (8 GPUs)

Stage 2 — Dr. GRPO

Hyperparameter Value
Loss type Dr. GRPO
Epochs 1
Generations per prompt (G) 8
Max completion length 8,192
Learning rate 1×10⁻⁶
LR scheduler Constant
Optimizer AdamW (β₁=0.9, β₂=0.95)
Clipping ε 0.2
KL coefficient β 0.0
Per-device batch size 4
Gradient accumulation 8 steps
Precision bfloat16
Distributed strategy FSDP (8 GPUs)
Generation backend vLLM (colocate mode)

Reward Function

The RL stage uses a composite reward (weighted sum):

Component Weight (ES) Description
Accuracy 1.0 Binary correctness of \boxed{} answer vs. ground truth (symbolic match, string fallback)
Language 0.2 FastText language ID score over reasoning trace (60%) and output (40%)
Format 0.1 Incremental score for correct <think>...</think> and \boxed{} structure
Repetition penalty 0.3 Penalizes n-gram loops, token flooding, and character-level repetition; normalized by √T, capped at 1.0
Spanish naturalness 0.5 Penalizes unnatural punctuation in chain-of-thought (see note below)

Note on Spanish naturalness: We observed that the model tends to produce unnatural uses of inverted question and exclamation marks (e.g., wrapping declarative statements in ¿...?, stacked punctuation like ¿¿ or ¿?). We introduced a dedicated naturalness reward to penalize these patterns. While this reduces their frequency, the behaviour is not fully eliminated — we are actively working on addressing this.


Inference

The model expects a system prompt instructing it to reason within <think> tags before answering, and a \boxed{} expression for the final answer. Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DGurgurov/SmolLM3-3B-SFT-GRPO-ES"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

messages = [
    {"role": "system", "content": "Eres un asistente útil. Primero piensa en etiquetas <think> y luego da tu respuesta."},
    {"role": "user", "content": "¿Cuánto es 15 × 27?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=16384, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@misc{reasonxl2026,
  title        = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus},
  author       = {Daniil Gurgurov and Tom Röhr},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}}
}

Paper citation will be added upon publication.

Downloads last month
8
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DGurgurov/SmolLM3-3B-SFT-GRPO-ES

Finetuned
(124)
this model

Dataset used to train DGurgurov/SmolLM3-3B-SFT-GRPO-ES

Collection including DGurgurov/SmolLM3-3B-SFT-GRPO-ES