SmolLM3-3B-SFT-GRPO-ES
SmolLM3-3B fine-tuned on ReasonXL and further trained with reinforcement learning (Dr. GRPO) for Spanish reasoning adaptation.
Model Description
This model is the result of a two-stage reasoning adaptation pipeline applied to HuggingFaceTB/SmolLM3-3B:
- Stage 1 — SFT (
DGurgurov/SmolLM3-3B-SFT-ES): shifts the model's reasoning language from English to Spanish via supervised fine-tuning on Spanish reasoning traces. - Stage 2 — RL (Dr. GRPO): recovers reasoning quality lost during SFT while preserving target-language compliance, using a composite reward over verifiable math problems.
| Property | Value |
|---|---|
| Base model | HuggingFaceTB/SmolLM3-3B |
| Training stages | SFT → Dr. GRPO |
| Target language | Spanish (es) |
| SFT data | ReasonXL — ES split (~8.7B tokens) |
| RL data | 20K translated MATH + GSM8K prompts (ES) |
Training Data
Stage 1: SFT
The model is trained on the Spanish split of ReasonXL, a multilingual cross-domain reasoning corpus of 2M samples per language (9B tokens). English source samples were translated using Qwen3-32B with a dedicated prompt preserving technical terminology, mathematical notation, and reasoning structure.
Source datasets included in ReasonXL:
| Dataset | Config | Samples |
|---|---|---|
| Cascade-SFT-Stage-2 | general / math | 768,615 |
| Dolci-Think-SFT-7B | science | 347,453 |
| Cascade-SFT-Stage-1 | general / code / math / science | 711,812 |
| Llama-Nemotron-PTD | science | 267,147 |
| Nemotron-Science-v1 | — | 97,026 |
| Nemotron-IF-Chat-v1 | — | 91,151 |
| Total | 2,282,204 |
Stage 2: RL
A per-language set of 20K prompts with verifiable answers, drawn from translated subsets of MATH and GSM8K using the same translation pipeline.
Training Details
Stage 1 — SFT
Completion-only loss; user and system tokens are masked. Sequences are chat-formatted and packed to 16,384 tokens.
| Hyperparameter | Value |
|---|---|
| Epochs | 2 |
| Max sequence length | 16,384 |
| Packing | Enabled |
| Precision | bfloat16 |
| Optimizer | adamw_torch_fused |
| Per-device batch size | 4 |
| Gradient accumulation | 4 steps |
| Weight decay | 0.05 |
| LR scheduler | Cosine with min LR |
| Min LR | 5×10⁻⁶ |
| Warmup ratio | 0.05 |
| Distributed strategy | FSDP (8 GPUs) |
Stage 2 — Dr. GRPO
| Hyperparameter | Value |
|---|---|
| Loss type | Dr. GRPO |
| Epochs | 1 |
| Generations per prompt (G) | 8 |
| Max completion length | 8,192 |
| Learning rate | 1×10⁻⁶ |
| LR scheduler | Constant |
| Optimizer | AdamW (β₁=0.9, β₂=0.95) |
| Clipping ε | 0.2 |
| KL coefficient β | 0.0 |
| Per-device batch size | 4 |
| Gradient accumulation | 8 steps |
| Precision | bfloat16 |
| Distributed strategy | FSDP (8 GPUs) |
| Generation backend | vLLM (colocate mode) |
Reward Function
The RL stage uses a composite reward (weighted sum):
| Component | Weight (ES) | Description |
|---|---|---|
| Accuracy | 1.0 | Binary correctness of \boxed{} answer vs. ground truth (symbolic match, string fallback) |
| Language | 0.2 | FastText language ID score over reasoning trace (60%) and output (40%) |
| Format | 0.1 | Incremental score for correct <think>...</think> and \boxed{} structure |
| Repetition penalty | 0.3 | Penalizes n-gram loops, token flooding, and character-level repetition; normalized by √T, capped at 1.0 |
| Spanish naturalness | 0.5 | Penalizes unnatural punctuation in chain-of-thought (see note below) |
Note on Spanish naturalness: We observed that the model tends to produce unnatural uses of inverted question and exclamation marks (e.g., wrapping declarative statements in
¿...?, stacked punctuation like¿¿or¿?). We introduced a dedicated naturalness reward to penalize these patterns. While this reduces their frequency, the behaviour is not fully eliminated — we are actively working on addressing this.
Inference
The model expects a system prompt instructing it to reason within <think> tags before answering, and a \boxed{} expression for the final answer. Example:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "DGurgurov/SmolLM3-3B-SFT-GRPO-ES"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")
messages = [
{"role": "system", "content": "Eres un asistente útil. Primero piensa en etiquetas <think> y luego da tu respuesta."},
{"role": "user", "content": "¿Cuánto es 15 × 27?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=16384, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
@misc{reasonxl2026,
title = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus},
author = {Daniil Gurgurov and Tom Röhr},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}}
}
Paper citation will be added upon publication.
- Downloads last month
- 8
Model tree for DGurgurov/SmolLM3-3B-SFT-GRPO-ES
Base model
HuggingFaceTB/SmolLM3-3B-Base