HF CMU Collab

community

Activity Feed

AI & ML interests

None defined yet.

qgallouedec

posted an update 3 days ago

Post

1617

TRL v1.2 introduces the SSDTrainer 🚀

Simple Self-Distillation (SSD) from Apple's paper "Embarrassingly Simple Self-Distillation Improves Code Generation" is now available as an experimental trainer in TRL.

The recipe is as minimal as the name suggests: sample completions from the model itself at a training-time temperature, then fine-tune on those raw, unverified samples with plain cross-entropy. No reward model. No verifier. No teacher model. No reinforcement learning. Just prompts and the model.

from trl.experimental.ssd import SSDConfig, SSDTrainer

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(temperature=0.6, top_k=20, top_p=0.95),
    train_dataset=dataset,
)
trainer.train()

v1.2 also ships expanded tool-calling support (LLaMA 3.1 / 3.2, DeepSeek-V3), another round of KTO ↔ DPO alignment getting us closer to promoting KTO to stable, a big GRPO simplification for overlong tool results, deprecation of use_transformers_paged, and key fixes for VLM response parsing.

Full release notes: https://github.com/huggingface/trl/releases/tag/v1.2.0

qgallouedec

posted an update 19 days ago

Post

2308

TRL v1.0 is out!

Hugging Face's TRL library is downloaded 3 million times a month. Over 130k models trained with it are public on the Hub, and major projects like @unsloth and @axolotl-ai-co build directly on top of it. v1.0 is the moment we acknowledged that responsibility explicitly, with a real stability contract.

The field hasn't settled. Building stable software in a domain that keeps invalidating its own assumptions is the actual problem we're solving. The answer is a design that can absorb the next shift without breaking what people rely on.

What's in v1.0:
Deep Hugging Face integration, low infrastructure burden
What's next: asynchronous GRPO, better scaling support, and making training legible enough that agents can inspect and steer it.

pip install --upgrade trl

Single-minus gluon tree amplitudes are nonzero

Paper • 2602.12176 • Published Feb 12 • 8

Reasoning Cache: Continual Improvement Over Long Horizons via Short-Horizon RL

Paper • 2602.03773 • Published Feb 3 • 13

qgallouedec

submitted a paper to Daily Papers 4 months ago

INTELLECT-3: Technical Report

Paper • 2512.16144 • Published Dec 18, 2025 • 20

lewtun

authored a paper 11 months ago

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Paper • 2504.11354 • Published Apr 15, 2025 • 6

lewtun

authored a paper about 1 year ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7, 2025 • 207

edbeeching

authored a paper about 1 year ago

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10, 2025 • 48

lewtun

authored a paper about 1 year ago

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10, 2025 • 48

CohenQu

authored 3 papers about 1 year ago

Recursive Introspection: Teaching Language Model Agents How to Self-Improve

Paper • 2407.18219 • Published Jul 25, 2024 • 3

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning

Paper • 2310.18247 • Published Oct 27, 2023

Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning

Paper • 2503.07572 • Published Mar 10, 2025 • 48

lewtun

posted an update about 1 year ago

Post

4635

Introducing OlympicCoder: a series of open reasoning models that can solve olympiad-level programming problems 🧑‍💻

- 7B open-r1/OlympicCoder-7B
- 32B open-r1/OlympicCoder-32B

We find that OlympicCoder models outperform Claude 3.7 Sonnet, as well as others over 100x larger 💪

Together with the models, we are releasing:

📊CodeForces-CoTs: new dataset of code problems from the most popular competitive coding platform, with R1 traces in C++ and Python open-r1/codeforces-cots

🏆 IOI'2024: a new benchmark of VERY hard programming problems where even frontier models struggle to match human performance open-r1/ioi

For links to the models and datasets, check out our latest progress report from Open R1: https://huggingface.co/blog/open-r1/update-3

1 reply

aviralku

authored 6 papers about 1 year ago

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Paper • 2310.10639 • Published Oct 16, 2023 • 3

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

Paper • 2402.02651 • Published Feb 5, 2024

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Paper • 2406.14532 • Published Jun 20, 2024

AI & ML interests

Team members 7

hf-cmu-collab's activity