SLM-RL-Agent β Models
Companion model repository for the paper Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement.
| Code | github.com/rezwanh001/slm-rl-agent |
| Datasets | mr3haque/SLM-RL-Agent-Data |
| License | Apache-2.0 |
| Hardware | 1 Γ NVIDIA RTX A6000 (48 GB) |
This single repository hosts all 30 trained checkpoints from the SLM-RL-Agent framework β 15 supervised-fine-tuned (SFT) small language models and 15 PPO-aligned (RLHF) small language models β spanning 5 architectures Γ 3 text corpora.
Repository layout
SLM-RL-Agent/
βββ sft/ # 15 LoRA adapters
β βββ pythia-70m/
β β βββ tinystories/ # (adapter_model.safetensors + tokenizer)
β β βββ cnn_dailymail/
β β βββ wikitext/
β βββ pythia-160m/ ...
β βββ pythia-410m/ ...
β βββ smollm2-135m/ ...
β βββ smollm2-360m/ ...
β
βββ ppo/ # 15 FULL merged models (base + SFT + PPO)
βββ pythia-70m/
β βββ tinystories/ # (model.safetensors + tokenizer)
β βββ cnn_dailymail/
β βββ wikitext/
βββ ... (same structure for all 5 models)
Each SFT directory is a LoRA adapter that sits on top of the corresponding public base model. Each PPO directory is a fully merged model that already contains the base weights + SFT LoRA + PPO LoRA collapsed into a single full checkpoint β no PEFT installation required to load it.
Main results β 15 configurations
Evaluated on the first 200 prompts of each domain's held-out split
(num_samples=200, matching the raw outputs/*/eval_*/evaluation_results.json
files shipped with the GitHub repo).
Reward comes from the
SLM-RL-Agent BradleyβTerry reward model (per-configuration scale). Win rate is
the analytical probability that a PPO response scores higher than an SFT response
on the same prompt, Ξ¦(Ξ / β(ΟΒ²_PPO + ΟΒ²_SFT)).
| Model | Params | Dataset | SFT PPL β | PPO PPL β | SFT Reward β | PPO Reward β | Ξ Reward | Win Rate |
|---|---|---|---|---|---|---|---|---|
| Pythia 70M | 70 | tinystories | 51.41 | 51.18 | +6.61 Β± 1.63 | +6.53 Β± 1.42 | -0.075 | 48.6% |
| Pythia 70M | 70 | cnn_dailymail | 70.29 | 70.54 | +6.22 Β± 1.21 | +6.03 Β± 1.23 | -0.187 | 45.7% |
| Pythia 70M | 70 | wikitext | 115.08 | 116.66 | +5.81 Β± 1.24 | +5.75 Β± 1.30 | -0.062 | 48.6% |
| Pythia 162M | 162 | tinystories | 13.48 | 13.50 | -8.52 Β± 2.39 | -8.28 Β± 2.46 | +0.238 | 52.8% |
| Pythia 162M | 162 | cnn_dailymail | 29.40 | 29.41 | -8.52 Β± 1.31 | -8.71 Β± 1.19 | -0.198 | 45.6% |
| Pythia 162M | 162 | wikitext | 53.51 | 53.18 | -8.40 Β± 2.65 | -8.35 Β± 2.34 | +0.043 | 50.5% |
| Pythia 410M | 410 | tinystories | 6.5 | 7.3 | -4.28 Β± 4.14 | -2.92 Β± 3.48 | +1.355 | 59.9% |
| Pythia 410M | 410 | cnn_dailymail | 16.24 | 17.05 | +1.20 Β± 1.76 | +0.94 Β± 1.79 | -0.259 | 45.9% |
| Pythia 410M | 410 | wikitext | 25.37 | 27.53 | +1.14 Β± 2.89 | +0.10 Β± 2.84 | -1.043 | 39.9% |
| SmolLM2 135M | 135 | tinystories | 7.0 | 7.4 | -0.92 Β± 2.26 | -0.69 Β± 1.96 | +0.226 | 53.0% |
| SmolLM2 135M | 135 | cnn_dailymail | 18.80 | 19.20 | +0.22 Β± 1.90 | +0.03 Β± 1.90 | -0.194 | 47.1% |
| SmolLM2 135M | 135 | wikitext | 24.40 | 25.10 | -0.44 Β± 1.53 | -0.42 Β± 1.41 | +0.015 | 50.3% |
| SmolLM2 361M | 361 | tinystories | 5.3 | 5.3 | +1.69 Β± 2.25 | +2.41 Β± 1.89 | +0.724 | 59.7% |
| SmolLM2 361M | 361 | cnn_dailymail | 12.71 | 12.82 | +2.36 Β± 1.09 | +2.36 Β± 1.05 | -0.001 | 50.0% |
| SmolLM2 361M | 361 | wikitext | 16.66 | 16.92 | +2.71 Β± 1.28 | +2.98 Β± 1.06 | +0.272 | 56.5% |
Key findings.
- Capacity-headroom hypothesis. The three largest positive reward deltas occur at the two highest-capacity models: Pythia-410M / TinyStories (Ξ = +1.36), SmolLM2-360M / TinyStories (Ξ = +0.72), SmolLM2-360M / Wikitext-103 (Ξ = +0.27). Models whose SFT baseline is already near-perfect see diminishing returns at this training budget β PPO gain is governed by the gap between a fluent SFT prior and the reward ceiling, not by raw parameter count.
- No repetition collapse. PPO consistently preserves or improves Distinct-2 diversity over the SFT baseline β e.g. SmolLM2-360M / Wikitext goes from Distinct-1 = 0.23 β 0.31 and Distinct-2 = 0.65 β 0.73.
- Efficiency. Every configuration trains end-to-end (SFT β reward β PPO β eval) in a few GPU-hours on a single RTX A6000.
Comparison vs. published SOTA instruct-tuned SLMs
Each instruct baseline is scored with the same SLM-RL-Agent reward model per dataset. Lower perplexity = better; higher reward = better.
| Class | Model | Training regime | TS PPL | TS R | CNN PPL | CNN R | Wiki PPL | Wiki R |
|---|---|---|---|---|---|---|---|---|
| 135M | SmolLM2-135M-Instruct | instruct-tune 1.7T tok | 8.5 | -0.52 | 19.79 | +0.34 | 34.30 | -0.79 |
| 135M | SmolLM2-135M (ours, SFT) | LoRA, 5 ep, 10K ex | 7.0 | -0.92 | 18.80 | +0.22 | 24.40 | -0.44 |
| 135M | SmolLM2-135M (ours, PPO) | + 250-step PPO RLHF | 7.4 | -0.69 | 19.20 | +0.03 | 25.10 | -0.42 |
| 360M+ | SmolLM2-360M-Instruct | instruct-tune 1.7T tok | 6.5 | +1.35 | 14.65 | +3.08 | 24.32 | +2.58 |
| 360M+ | Qwen2.5-0.5B-Instruct | instruct-tune 18T tok | 7.2 | +1.32 | 19.85 | +2.58 | 25.77 | +1.83 |
| 360M+ | SmolLM2-360M (ours, SFT) | LoRA, 5 ep, 10K ex | 5.3 | +1.69 | 12.71 | +2.36 | 16.66 | +2.71 |
| 360M+ | SmolLM2-360M (ours, PPO) | + 250-step PPO RLHF | 5.3 | +2.41 | 12.82 | +2.36 | 16.92 | +2.98 |
Highlights.
- Our 360M-class SFT beats every instruct baseline on perplexity across every dataset β the largest margin is on Wikitext-103 (16.7 vs. 24.3, a 30 % reduction) at a single-GPU, domain-specific training budget.
- At the 360M class, our PPO checkpoint is the best on TinyStories reward (+2.41 vs. +1.35 for SmolLM2-360M-Instruct and +1.32 for Qwen2.5-0.5B-Instruct) and best on Wikitext-103 reward (+2.98 vs. +2.58 and +1.83).
Usage
Load an SFT LoRA adapter
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Pick one of the 15 (model, dataset) combinations
model_key, dataset = "smollm2-360m", "wikitext"
adapter_dir = snapshot_download(
repo_id="mr3haque/SLM-RL-Agent",
allow_patterns=f"sft/{model_key}/{dataset}/**",
)
adapter_path = f"{adapter_dir}/sft/{model_key}/{dataset}"
base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tok = AutoTokenizer.from_pretrained(adapter_path)
model = PeftModel.from_pretrained(base, adapter_path).merge_and_unload()
Load a PPO model (already merged)
from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
model_key, dataset = "smollm2-360m", "wikitext"
ppo_dir = snapshot_download(
repo_id="mr3haque/SLM-RL-Agent",
allow_patterns=f"ppo/{model_key}/{dataset}/**",
)
ppo_path = f"{ppo_dir}/ppo/{model_key}/{dataset}"
tok = AutoTokenizer.from_pretrained(ppo_path)
model = AutoModelForCausalLM.from_pretrained(ppo_path)
Training recipe (identical for all 15 configurations)
| Stage | Library | Key hyperparameters |
|---|---|---|
| SFT | HuggingFace Trainer + PEFT | LoRA r=16, Ξ±=32, 3 epochs, bs 8Γ4, LR 2e-4, bf16 |
| Reward model | HuggingFace Trainer | BradleyβTerry pairwise loss, 1 epoch, LR 1e-5 |
| PPO | TRL 0.9.x | 250 steps, LR 5e-6, KL 0.05β0.2, score clip Β±3Ο, float32, weight rollback |
Three engineering fixes unique to the SLM regime β all implemented in
scripts/train_ppo.py:
- Merge-and-reinitialize for PEFT+PPO. TRL β€ 0.9.x silently freezes LoRA parameters when the policy is a PEFT adapter. We merge the SFT adapter into the base weights, then attach a fresh LoRA on top before PPO.
- Float32 throughout. Bfloat16 causes probability-ratio explosions (> 10βΆ) within the first PPO batch for models < 200M parameters. Float32 is required.
- Reward whitening + weight rollback. Score-clipping at Β±3Ο and a per-step weight-rollback mechanism that reverts to the last healthy snapshot on NaN/Inf eliminate catastrophic collapse across all 15 runs.
Citation
@misc{haque2026slmrlagent,
title = {Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement},
author = {Haque, Md. Rezwanul},
year = {2026},
howpublished = {\url{https://github.com/rezwanh001/slm-rl-agent}},
note = {University of Waterloo, CPAMI Lab}
}
Model tree for mr3haque/SLM-RL-Agent
Base model
EleutherAI/pythia-160m-deduped