SmolLM3-3B-SFT-GRPO-ES

SmolLM3-3B fine-tuned on Reason_XL and further trained with reinforcement learning (Dr. GRPO) for Spanish reasoning adaptation.

Model Description

This model is the result of a two-stage reasoning adaptation pipeline applied to HuggingFaceTB/SmolLM3-3B:

Stage 1 — SFT (DGurgurov/SmolLM3-3B-SFT-ES): shifts the model's reasoning language from English to Spanish via supervised fine-tuning on Spanish reasoning traces.
Stage 2 — RL (Dr. GRPO): recovers reasoning quality lost during SFT while preserving target-language compliance, using a composite reward over verifiable math problems.

Property	Value
Base model	`HuggingFaceTB/SmolLM3-3B`
Training stages	SFT → Dr. GRPO
Target language	Spanish (`es`)
SFT data	Reason_XL — ES split (~8.7B tokens)
RL data	20K translated MATH + GSM8K prompts (ES)

Training Data

Stage 1: SFT

The model is trained on the Spanish split of Reason_XL, a multilingual cross-domain reasoning corpus of ~~2M samples per language (~~9B tokens). English source samples were translated using Qwen3-32B with a dedicated prompt preserving technical terminology, mathematical notation, and reasoning structure.

Source datasets included in Reason_XL:

Dataset	Config	Samples
Cascade-SFT-Stage-2	general / math	768,615
Dolci-Think-SFT-7B	science	347,453
Cascade-SFT-Stage-1	general / code / math / science	711,812
Llama-Nemotron-PTD	science	267,147
Nemotron-Science-v1	—	97,026
Nemotron-IF-Chat-v1	—	91,151
Total		2,282,204

Stage 2: RL

A per-language set of 20K prompts with verifiable answers, drawn from translated subsets of MATH and GSM8K using the same translation pipeline.

Training Details

Stage 1 — SFT

Completion-only loss; user and system tokens are masked. Sequences are chat-formatted and packed to 16,384 tokens.

Hyperparameter	Value
Epochs	2
Max sequence length	16,384
Packing	Enabled
Precision	bfloat16
Optimizer	`adamw_torch_fused`
Per-device batch size	4
Gradient accumulation	4 steps
Weight decay	0.05
LR scheduler	Cosine with min LR
Min LR	5×10⁻⁶
Warmup ratio	0.05
Distributed strategy	FSDP (8 GPUs)

Stage 2 — Dr. GRPO

Hyperparameter	Value
Loss type	Dr. GRPO
Epochs	1
Generations per prompt (G)	8
Max completion length	8,192
Learning rate	1×10⁻⁶
LR scheduler	Constant
Optimizer	AdamW (β₁=0.9, β₂=0.95)
Clipping ε	0.2
KL coefficient β	0.0
Per-device batch size	4
Gradient accumulation	8 steps
Precision	bfloat16
Distributed strategy	FSDP (8 GPUs)
Generation backend	vLLM (colocate mode)

Reward Function

The RL stage uses a composite reward (weighted sum):

Component	Weight (ES)	Description
Accuracy	1.0	Binary correctness of `\boxed{}` answer vs. ground truth (symbolic match, string fallback)
Language	0.2	FastText language ID score over reasoning trace (60%) and output (40%)
Format	0.1	Incremental score for correct `<think>...</think>` and `\boxed{}` structure
Repetition penalty	0.3	Penalizes n-gram loops, token flooding, and character-level repetition; normalized by √T, capped at 1.0
Spanish naturalness	0.5	Penalizes unnatural punctuation in chain-of-thought (see note below)

Note on Spanish naturalness: We observed that the model tends to produce unnatural uses of inverted question and exclamation marks (e.g., wrapping declarative statements in ¿...?, stacked punctuation like ¿¿ or ¿?). We introduced a dedicated naturalness reward to penalize these patterns. While this reduces their frequency, the behaviour is not fully eliminated — we are actively working on addressing this.

Inference

The model expects a system prompt instructing it to reason within <think> tags before answering, and a \boxed{} expression for the final answer. Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "DGurgurov/SmolLM3-3B-SFT-GRPO-ES"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="bfloat16", device_map="auto")

messages = [
    {"role": "system", "content": "Eres un asistente útil. Primero piensa en etiquetas <think> y luego da tu respuesta."},
    {"role": "user", "content": "¿Cuánto es 15 × 27?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=16384, temperature=0.6, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

@misc{reasonxl2026,
  title        = {Reason{XL}: A Multilingual Cross-Domain Reasoning Corpus},
  author       = {Daniil Gurgurov and Tom Röhr},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/toroe/Soofi-Think-SFT-10B-multilingual}}
}