SLM-RL-Agent β€” Models

Companion model repository for the paper Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement.

Code github.com/rezwanh001/slm-rl-agent
Datasets mr3haque/SLM-RL-Agent-Data
License Apache-2.0
Hardware 1 Γ— NVIDIA RTX A6000 (48 GB)

This single repository hosts all 30 trained checkpoints from the SLM-RL-Agent framework β€” 15 supervised-fine-tuned (SFT) small language models and 15 PPO-aligned (RLHF) small language models β€” spanning 5 architectures Γ— 3 text corpora.


Repository layout

SLM-RL-Agent/
β”œβ”€β”€ sft/                               # 15 LoRA adapters
β”‚   β”œβ”€β”€ pythia-70m/
β”‚   β”‚   β”œβ”€β”€ tinystories/               #  (adapter_model.safetensors + tokenizer)
β”‚   β”‚   β”œβ”€β”€ cnn_dailymail/
β”‚   β”‚   └── wikitext/
β”‚   β”œβ”€β”€ pythia-160m/     ...
β”‚   β”œβ”€β”€ pythia-410m/     ...
β”‚   β”œβ”€β”€ smollm2-135m/    ...
β”‚   └── smollm2-360m/    ...
β”‚
└── ppo/                               # 15 FULL merged models (base + SFT + PPO)
    β”œβ”€β”€ pythia-70m/
    β”‚   β”œβ”€β”€ tinystories/               #  (model.safetensors + tokenizer)
    β”‚   β”œβ”€β”€ cnn_dailymail/
    β”‚   └── wikitext/
    β”œβ”€β”€ ... (same structure for all 5 models)

Each SFT directory is a LoRA adapter that sits on top of the corresponding public base model. Each PPO directory is a fully merged model that already contains the base weights + SFT LoRA + PPO LoRA collapsed into a single full checkpoint β€” no PEFT installation required to load it.


Main results β€” 15 configurations

Evaluated on the first 200 prompts of each domain's held-out split (num_samples=200, matching the raw outputs/*/eval_*/evaluation_results.json files shipped with the GitHub repo). Reward comes from the SLM-RL-Agent Bradley–Terry reward model (per-configuration scale). Win rate is the analytical probability that a PPO response scores higher than an SFT response on the same prompt, Ξ¦(Ξ” / √(σ²_PPO + σ²_SFT)).

Model Params Dataset SFT PPL ↓ PPO PPL ↓ SFT Reward ↑ PPO Reward ↑ Ξ” Reward Win Rate
Pythia 70M 70 tinystories 51.41 51.18 +6.61 Β± 1.63 +6.53 Β± 1.42 -0.075 48.6%
Pythia 70M 70 cnn_dailymail 70.29 70.54 +6.22 Β± 1.21 +6.03 Β± 1.23 -0.187 45.7%
Pythia 70M 70 wikitext 115.08 116.66 +5.81 Β± 1.24 +5.75 Β± 1.30 -0.062 48.6%
Pythia 162M 162 tinystories 13.48 13.50 -8.52 Β± 2.39 -8.28 Β± 2.46 +0.238 52.8%
Pythia 162M 162 cnn_dailymail 29.40 29.41 -8.52 Β± 1.31 -8.71 Β± 1.19 -0.198 45.6%
Pythia 162M 162 wikitext 53.51 53.18 -8.40 Β± 2.65 -8.35 Β± 2.34 +0.043 50.5%
Pythia 410M 410 tinystories 6.5 7.3 -4.28 Β± 4.14 -2.92 Β± 3.48 +1.355 59.9%
Pythia 410M 410 cnn_dailymail 16.24 17.05 +1.20 Β± 1.76 +0.94 Β± 1.79 -0.259 45.9%
Pythia 410M 410 wikitext 25.37 27.53 +1.14 Β± 2.89 +0.10 Β± 2.84 -1.043 39.9%
SmolLM2 135M 135 tinystories 7.0 7.4 -0.92 Β± 2.26 -0.69 Β± 1.96 +0.226 53.0%
SmolLM2 135M 135 cnn_dailymail 18.80 19.20 +0.22 Β± 1.90 +0.03 Β± 1.90 -0.194 47.1%
SmolLM2 135M 135 wikitext 24.40 25.10 -0.44 Β± 1.53 -0.42 Β± 1.41 +0.015 50.3%
SmolLM2 361M 361 tinystories 5.3 5.3 +1.69 Β± 2.25 +2.41 Β± 1.89 +0.724 59.7%
SmolLM2 361M 361 cnn_dailymail 12.71 12.82 +2.36 Β± 1.09 +2.36 Β± 1.05 -0.001 50.0%
SmolLM2 361M 361 wikitext 16.66 16.92 +2.71 Β± 1.28 +2.98 Β± 1.06 +0.272 56.5%

Key findings.

  • Capacity-headroom hypothesis. The three largest positive reward deltas occur at the two highest-capacity models: Pythia-410M / TinyStories (Ξ” = +1.36), SmolLM2-360M / TinyStories (Ξ” = +0.72), SmolLM2-360M / Wikitext-103 (Ξ” = +0.27). Models whose SFT baseline is already near-perfect see diminishing returns at this training budget β€” PPO gain is governed by the gap between a fluent SFT prior and the reward ceiling, not by raw parameter count.
  • No repetition collapse. PPO consistently preserves or improves Distinct-2 diversity over the SFT baseline β€” e.g. SmolLM2-360M / Wikitext goes from Distinct-1 = 0.23 β†’ 0.31 and Distinct-2 = 0.65 β†’ 0.73.
  • Efficiency. Every configuration trains end-to-end (SFT β†’ reward β†’ PPO β†’ eval) in a few GPU-hours on a single RTX A6000.

Comparison vs. published SOTA instruct-tuned SLMs

Each instruct baseline is scored with the same SLM-RL-Agent reward model per dataset. Lower perplexity = better; higher reward = better.

Class Model Training regime TS PPL TS R CNN PPL CNN R Wiki PPL Wiki R
135M SmolLM2-135M-Instruct instruct-tune 1.7T tok 8.5 -0.52 19.79 +0.34 34.30 -0.79
135M SmolLM2-135M (ours, SFT) LoRA, 5 ep, 10K ex 7.0 -0.92 18.80 +0.22 24.40 -0.44
135M SmolLM2-135M (ours, PPO) + 250-step PPO RLHF 7.4 -0.69 19.20 +0.03 25.10 -0.42
360M+ SmolLM2-360M-Instruct instruct-tune 1.7T tok 6.5 +1.35 14.65 +3.08 24.32 +2.58
360M+ Qwen2.5-0.5B-Instruct instruct-tune 18T tok 7.2 +1.32 19.85 +2.58 25.77 +1.83
360M+ SmolLM2-360M (ours, SFT) LoRA, 5 ep, 10K ex 5.3 +1.69 12.71 +2.36 16.66 +2.71
360M+ SmolLM2-360M (ours, PPO) + 250-step PPO RLHF 5.3 +2.41 12.82 +2.36 16.92 +2.98

Highlights.

  • Our 360M-class SFT beats every instruct baseline on perplexity across every dataset β€” the largest margin is on Wikitext-103 (16.7 vs. 24.3, a 30 % reduction) at a single-GPU, domain-specific training budget.
  • At the 360M class, our PPO checkpoint is the best on TinyStories reward (+2.41 vs. +1.35 for SmolLM2-360M-Instruct and +1.32 for Qwen2.5-0.5B-Instruct) and best on Wikitext-103 reward (+2.98 vs. +2.58 and +1.83).

Usage

Load an SFT LoRA adapter

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Pick one of the 15 (model, dataset) combinations
model_key, dataset = "smollm2-360m", "wikitext"
adapter_dir = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agent",
    allow_patterns=f"sft/{model_key}/{dataset}/**",
)
adapter_path = f"{adapter_dir}/sft/{model_key}/{dataset}"

base  = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tok   = AutoTokenizer.from_pretrained(adapter_path)
model = PeftModel.from_pretrained(base, adapter_path).merge_and_unload()

Load a PPO model (already merged)

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

model_key, dataset = "smollm2-360m", "wikitext"
ppo_dir = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agent",
    allow_patterns=f"ppo/{model_key}/{dataset}/**",
)
ppo_path = f"{ppo_dir}/ppo/{model_key}/{dataset}"

tok   = AutoTokenizer.from_pretrained(ppo_path)
model = AutoModelForCausalLM.from_pretrained(ppo_path)

Training recipe (identical for all 15 configurations)

Stage Library Key hyperparameters
SFT HuggingFace Trainer + PEFT LoRA r=16, Ξ±=32, 3 epochs, bs 8Γ—4, LR 2e-4, bf16
Reward model HuggingFace Trainer Bradley–Terry pairwise loss, 1 epoch, LR 1e-5
PPO TRL 0.9.x 250 steps, LR 5e-6, KL 0.05–0.2, score clip Β±3Οƒ, float32, weight rollback

Three engineering fixes unique to the SLM regime β€” all implemented in scripts/train_ppo.py:

  1. Merge-and-reinitialize for PEFT+PPO. TRL ≀ 0.9.x silently freezes LoRA parameters when the policy is a PEFT adapter. We merge the SFT adapter into the base weights, then attach a fresh LoRA on top before PPO.
  2. Float32 throughout. Bfloat16 causes probability-ratio explosions (> 10⁢) within the first PPO batch for models < 200M parameters. Float32 is required.
  3. Reward whitening + weight rollback. Score-clipping at Β±3Οƒ and a per-step weight-rollback mechanism that reverts to the last healthy snapshot on NaN/Inf eliminate catastrophic collapse across all 15 runs.

Citation

@misc{haque2026slmrlagent,
  title  = {Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement},
  author = {Haque, Md. Rezwanul},
  year   = {2026},
  howpublished = {\url{https://github.com/rezwanh001/slm-rl-agent}},
  note   = {University of Waterloo, CPAMI Lab}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mr3haque/SLM-RL-Agent

Adapter
(4)
this model

Dataset used to train mr3haque/SLM-RL-Agent