SLM-RL-Agent — Models

Companion model repository for the paper Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement.


Code	github.com/rezwanh001/slm-rl-agent
Datasets	`mr3haque/SLM-RL-Agent-Data`
License	Apache-2.0
Hardware	1 × NVIDIA RTX A6000 (48 GB)

This single repository hosts all 30 trained checkpoints from the SLM-RL-Agent framework — 15 supervised-fine-tuned (SFT) small language models and 15 PPO-aligned (RLHF) small language models — spanning 5 architectures × 3 text corpora.

Repository layout

SLM-RL-Agent/
├── sft/                               # 15 LoRA adapters
│   ├── pythia-70m/
│   │   ├── tinystories/               #  (adapter_model.safetensors + tokenizer)
│   │   ├── cnn_dailymail/
│   │   └── wikitext/
│   ├── pythia-160m/     ...
│   ├── pythia-410m/     ...
│   ├── smollm2-135m/    ...
│   └── smollm2-360m/    ...
│
└── ppo/                               # 15 FULL merged models (base + SFT + PPO)
    ├── pythia-70m/
    │   ├── tinystories/               #  (model.safetensors + tokenizer)
    │   ├── cnn_dailymail/
    │   └── wikitext/
    ├── ... (same structure for all 5 models)

Each SFT directory is a LoRA adapter that sits on top of the corresponding public base model. Each PPO directory is a fully merged model that already contains the base weights + SFT LoRA + PPO LoRA collapsed into a single full checkpoint — no PEFT installation required to load it.

Main results — 15 configurations

Evaluated on the first 200 prompts of each domain's held-out split (num_samples=200, matching the raw outputs/*/eval_*/evaluation_results.json files shipped with the GitHub repo). Reward comes from the SLM-RL-Agent Bradley–Terry reward model (per-configuration scale). Win rate is the analytical probability that a PPO response scores higher than an SFT response on the same prompt, Φ(Δ / √(σ²_PPO + σ²_SFT)).

Model	Params	Dataset	SFT PPL ↓	PPO PPL ↓	SFT Reward ↑	PPO Reward ↑	Δ Reward	Win Rate
Pythia 70M	70	tinystories	51.41	51.18	+6.61 ± 1.63	+6.53 ± 1.42	-0.075	48.6%
Pythia 70M	70	cnn_dailymail	70.29	70.54	+6.22 ± 1.21	+6.03 ± 1.23	-0.187	45.7%
Pythia 70M	70	wikitext	115.08	116.66	+5.81 ± 1.24	+5.75 ± 1.30	-0.062	48.6%
Pythia 162M	162	tinystories	13.48	13.50	-8.52 ± 2.39	-8.28 ± 2.46	+0.238	52.8%
Pythia 162M	162	cnn_dailymail	29.40	29.41	-8.52 ± 1.31	-8.71 ± 1.19	-0.198	45.6%
Pythia 162M	162	wikitext	53.51	53.18	-8.40 ± 2.65	-8.35 ± 2.34	+0.043	50.5%
Pythia 410M	410	tinystories	6.5	7.3	-4.28 ± 4.14	-2.92 ± 3.48	+1.355	59.9%
Pythia 410M	410	cnn_dailymail	16.24	17.05	+1.20 ± 1.76	+0.94 ± 1.79	-0.259	45.9%
Pythia 410M	410	wikitext	25.37	27.53	+1.14 ± 2.89	+0.10 ± 2.84	-1.043	39.9%
SmolLM2 135M	135	tinystories	7.0	7.4	-0.92 ± 2.26	-0.69 ± 1.96	+0.226	53.0%
SmolLM2 135M	135	cnn_dailymail	18.80	19.20	+0.22 ± 1.90	+0.03 ± 1.90	-0.194	47.1%
SmolLM2 135M	135	wikitext	24.40	25.10	-0.44 ± 1.53	-0.42 ± 1.41	+0.015	50.3%
SmolLM2 361M	361	tinystories	5.3	5.3	+1.69 ± 2.25	+2.41 ± 1.89	+0.724	59.7%
SmolLM2 361M	361	cnn_dailymail	12.71	12.82	+2.36 ± 1.09	+2.36 ± 1.05	-0.001	50.0%
SmolLM2 361M	361	wikitext	16.66	16.92	+2.71 ± 1.28	+2.98 ± 1.06	+0.272	56.5%

Key findings.

Capacity-headroom hypothesis. The three largest positive reward deltas occur at the two highest-capacity models: Pythia-410M / TinyStories (Δ = +1.36), SmolLM2-360M / TinyStories (Δ = +0.72), SmolLM2-360M / Wikitext-103 (Δ = +0.27). Models whose SFT baseline is already near-perfect see diminishing returns at this training budget — PPO gain is governed by the gap between a fluent SFT prior and the reward ceiling, not by raw parameter count.
No repetition collapse. PPO consistently preserves or improves Distinct-2 diversity over the SFT baseline — e.g. SmolLM2-360M / Wikitext goes from Distinct-1 = 0.23 → 0.31 and Distinct-2 = 0.65 → 0.73.
Efficiency. Every configuration trains end-to-end (SFT → reward → PPO → eval) in a few GPU-hours on a single RTX A6000.

Comparison vs. published SOTA instruct-tuned SLMs

Each instruct baseline is scored with the same SLM-RL-Agent reward model per dataset. Lower perplexity = better; higher reward = better.

Class	Model	Training regime	TS PPL	TS R	CNN PPL	CNN R	Wiki PPL	Wiki R
135M	SmolLM2-135M-Instruct	instruct-tune 1.7T tok	8.5	-0.52	19.79	+0.34	34.30	-0.79
135M	SmolLM2-135M (ours, SFT)	LoRA, 5 ep, 10K ex	7.0	-0.92	18.80	+0.22	24.40	-0.44
135M	SmolLM2-135M (ours, PPO)	+ 250-step PPO RLHF	7.4	-0.69	19.20	+0.03	25.10	-0.42
360M+	SmolLM2-360M-Instruct	instruct-tune 1.7T tok	6.5	+1.35	14.65	+3.08	24.32	+2.58
360M+	Qwen2.5-0.5B-Instruct	instruct-tune 18T tok	7.2	+1.32	19.85	+2.58	25.77	+1.83
360M+	SmolLM2-360M (ours, SFT)	LoRA, 5 ep, 10K ex	5.3	+1.69	12.71	+2.36	16.66	+2.71
360M+	SmolLM2-360M (ours, PPO)	+ 250-step PPO RLHF	5.3	+2.41	12.82	+2.36	16.92	+2.98

Highlights.

Our 360M-class SFT beats every instruct baseline on perplexity across every dataset — the largest margin is on Wikitext-103 (16.7 vs. 24.3, a 30 % reduction) at a single-GPU, domain-specific training budget.
At the 360M class, our PPO checkpoint is the best on TinyStories reward (+2.41 vs. +1.35 for SmolLM2-360M-Instruct and +1.32 for Qwen2.5-0.5B-Instruct) and best on Wikitext-103 reward (+2.98 vs. +2.58 and +1.83).

Usage

Load an SFT LoRA adapter

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Pick one of the 15 (model, dataset) combinations
model_key, dataset = "smollm2-360m", "wikitext"
adapter_dir = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agent",
    allow_patterns=f"sft/{model_key}/{dataset}/**",
)
adapter_path = f"{adapter_dir}/sft/{model_key}/{dataset}"

base  = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-360M")
tok   = AutoTokenizer.from_pretrained(adapter_path)
model = PeftModel.from_pretrained(base, adapter_path).merge_and_unload()

Load a PPO model (already merged)

from huggingface_hub import snapshot_download
from transformers import AutoModelForCausalLM, AutoTokenizer

model_key, dataset = "smollm2-360m", "wikitext"
ppo_dir = snapshot_download(
    repo_id="mr3haque/SLM-RL-Agent",
    allow_patterns=f"ppo/{model_key}/{dataset}/**",
)
ppo_path = f"{ppo_dir}/ppo/{model_key}/{dataset}"

tok   = AutoTokenizer.from_pretrained(ppo_path)
model = AutoModelForCausalLM.from_pretrained(ppo_path)

Training recipe (identical for all 15 configurations)

Stage	Library	Key hyperparameters
SFT	HuggingFace Trainer + PEFT	LoRA r=16, α=32, 3 epochs, bs 8×4, LR 2e-4, bf16
Reward model	HuggingFace Trainer	Bradley–Terry pairwise loss, 1 epoch, LR 1e-5
PPO	TRL 0.9.x	250 steps, LR 5e-6, KL 0.05–0.2, score clip ±3σ, float32, weight rollback

Three engineering fixes unique to the SLM regime — all implemented in scripts/train_ppo.py:

Merge-and-reinitialize for PEFT+PPO. TRL ≤ 0.9.x silently freezes LoRA parameters when the policy is a PEFT adapter. We merge the SFT adapter into the base weights, then attach a fresh LoRA on top before PPO.
Float32 throughout. Bfloat16 causes probability-ratio explosions (> 10⁶) within the first PPO batch for models < 200M parameters. Float32 is required.
Reward whitening + weight rollback. Score-clipping at ±3σ and a per-step weight-rollback mechanism that reverts to the last healthy snapshot on NaN/Inf eliminate catastrophic collapse across all 15 runs.

Citation

@misc{haque2026slmrlagent,
  title  = {Efficiently Enhancing SLM Agents: A Reinforcement Learning Approach to Performance Improvement},
  author = {Haque, Md. Rezwanul},
  year   = {2026},
  howpublished = {\url{https://github.com/rezwanh001/slm-rl-agent}},
  note   = {University of Waterloo, CPAMI Lab}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for mr3haque/SLM-RL-Agent

Base model

EleutherAI/pythia-160m-deduped

Adapter

(4)

this model

mr3haque
/

SLM-RL-Agent