๐ฅ For a video walkthrough, check out "Let LLMs Wander - Engineering RL Environments": https://www.youtube.com/watch?v=71V3fTaUp2Q
Stefano Fiorucci PRO
anakin87
AI & ML interests
Language Models: orchestration, post-training, GRPO, synthetic data...
Contributing to Haystack LLM framework ๐๏ธ
Recent Activity
liked a Space 1 minute ago
HuggingFaceTB/trl-distillation-trainer repliedto their post 1 day ago
๐ฃ I just published a free course on Reinforcement Learning Environments for Language Models!
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe reacted to theirpost with ๐ 1 day ago
๐ Let LLMs wander - Engineering RL Environments
Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.
I've been exploring how to design them, figuring out what works and what doesn't.
If you want to learn how to build them, I recorded a practical intro video.
You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐
๐ฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q
---
๐ฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค๐น๏ธ Play against the trained model: https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toeOrganizations
repliedto their post 1 day ago
Post
1619
๐ Let LLMs wander - Engineering RL Environments
Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.
I've been exploring how to design them, figuring out what works and what doesn't.
If you want to learn how to build them, I recorded a practical intro video.
You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐
๐ฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q
---
๐ฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.
I've been exploring how to design them, figuring out what works and what doesn't.
If you want to learn how to build them, I recorded a practical intro video.
You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐
๐ฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q
---
๐ฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
posted an update 1 day ago
Post
1619
๐ Let LLMs wander - Engineering RL Environments
Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.
I've been exploring how to design them, figuring out what works and what doesn't.
If you want to learn how to build them, I recorded a practical intro video.
You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐
๐ฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q
---
๐ฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.
I've been exploring how to design them, figuring out what works and what doesn't.
If you want to learn how to build them, I recorded a practical intro video.
You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐
๐ฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q
---
๐ฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
Post
3252
๐ฃ I just published a free course on Reinforcement Learning Environments for Language Models!
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
Post
3252
๐ฃ I just published a free course on Reinforcement Learning Environments for Language Models!
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
posted an update 4 days ago
Post
3252
๐ฃ I just published a free course on Reinforcement Learning Environments for Language Models!
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
๐ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course
Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.
Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.
But what actually are these environments in practiceโ And how do you build them effectivelyโ
Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.
What you'll learn
๐น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐น Common patterns: How to build single-turn, multi-turn, and tool-use environments
๐น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐ธ Build the game Environment
๐ธ Use it to generate synthetic data for SFT warm-up
๐ธ Group-based Reinforcement Learning
If you're interested in building "little worlds" where LLMs can learn, this course is for you.
---
๐ค๐น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe
๐ HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
reactedto RakshitAralimatti's post with ๐ 4 days ago
Post
1526
๐ฅ GLM-5.1 (zai-org/GLM-5.1) โ Quietly One of the Best flagship model for agentic engineering and Coding tasks Right Now
threw some LangGraph agent code at it, a messy RAG pipeline, some async Python stuff and it just handled it. no drama, no hallucinated methods, actually usable output on the first try.
open source closing the gap this fast is genuinely exciting. go check zai-org/GLM-5.1 on HF if you haven't already
Good work @zai-org-3
threw some LangGraph agent code at it, a messy RAG pipeline, some async Python stuff and it just handled it. no drama, no hallucinated methods, actually usable output on the first try.
open source closing the gap this fast is genuinely exciting. go check zai-org/GLM-5.1 on HF if you haven't already
Good work @zai-org-3
posted an update 4 months ago
Post
392
๐ญ Do thinking traces make Language Models learn better? Curious what others think
๐ฆ๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ
You take an instruction-following LM.
You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe.
Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...
During training, the model could just output answers, but a common choice is to make it also output thinking traces.
๐ง๐ต๐ฒ ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป
Does forcing the model to produce thinking traces during training actually improve learningโ
๐ฌ I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.
From what I've understood so far, the answer seems to be ๐๐ฒ๐.
1๏ธโฃ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.
2๏ธโฃ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.
3๏ธโฃ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
๐ฆ๐ฐ๐ฒ๐ป๐ฎ๐ฟ๐ถ๐ผ
You take an instruction-following LM.
You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe.
Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...
During training, the model could just output answers, but a common choice is to make it also output thinking traces.
๐ง๐ต๐ฒ ๐พ๐๐ฒ๐๐๐ถ๐ผ๐ป
Does forcing the model to produce thinking traces during training actually improve learningโ
๐ฌ I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.
From what I've understood so far, the answer seems to be ๐๐ฒ๐.
1๏ธโฃ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.
2๏ธโฃ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.
3๏ธโฃ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
posted an update 5 months ago
Post
470
I made a visualization based on the Prime Intellect INTELLECT-3 technical report.
Wild to see how far they pushed GLM-4.5-Air-Base with SFT + RL.
SOTA for its size and competitive with models 3x larger.
All open.
Congrats on the release!
Model: PrimeIntellect/INTELLECT-3
Technical report: https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf
Chat: https://chat.primeintellect.ai/
Wild to see how far they pushed GLM-4.5-Air-Base with SFT + RL.
SOTA for its size and competitive with models 3x larger.
All open.
Congrats on the release!
Model: PrimeIntellect/INTELLECT-3
Technical report: https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf
Chat: https://chat.primeintellect.ai/
posted an update 5 months ago
Post
2896
LLMs can leak their post-training data (RL included) ๐ง
New interesting paper on this topic from Google DeepMind: Extracting alignment data in open models (2510.18554)
It's known that Language Models memorize data that can be extracted via prompting.
In this paper, the authors investigate this aspect:
- using open models, where prompting can be fully customized by the user, including special tokens.
- focusing on open-source models like Olmo, where full training data is available.
๐ค How do they extract data?
During post-training (like SFT), new tokens such as <|user|> are introduced.
The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).
For example, for SFT, their extraction prompt is <|endoftext|><|user|>.
๐ Evaluating memorization
The authors compare each sampled example with the original data using vector search with embedding similarity.
They find that many outputs are semantically very similar to the original data, even if the exact words differ.
Traditional string-matching algorithms underestimate memorization by 10x.
๐ What about RL?
Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.
This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).
Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.
๐ Are the extracted data effective for post-training?
Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.
The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.
New interesting paper on this topic from Google DeepMind: Extracting alignment data in open models (2510.18554)
It's known that Language Models memorize data that can be extracted via prompting.
In this paper, the authors investigate this aspect:
- using open models, where prompting can be fully customized by the user, including special tokens.
- focusing on open-source models like Olmo, where full training data is available.
๐ค How do they extract data?
During post-training (like SFT), new tokens such as <|user|> are introduced.
The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).
For example, for SFT, their extraction prompt is <|endoftext|><|user|>.
๐ Evaluating memorization
The authors compare each sampled example with the original data using vector search with embedding similarity.
They find that many outputs are semantically very similar to the original data, even if the exact words differ.
Traditional string-matching algorithms underestimate memorization by 10x.
๐ What about RL?
Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.
This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).
Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.
๐ Are the extracted data effective for post-training?
Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.
The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.
posted an update 7 months ago
Post
504
Your Language Model needs better (open) environments to learn ๐
๐ https://huggingface.co/blog/anakin87/environments-hub
RL environments help LLMs practice, reason, and improve.
I explored the Environments Hub and wrote a walkthrough showing how to train and evaluate models using these open environments.
1๏ธโฃ ๐ช๐ต๐ ๐ฅ๐ ๐บ๐ฎ๐๐๐ฒ๐ฟ๐ ๐ณ๐ผ๐ฟ ๐๐๐ ๐
DeepSeek-R1 made clear that Reinforcement Learning can be used to incentivize reasoning in LLMs.
In GRPO, the model generates multiple answers and learns to prefer the better ones from rewards.
2๏ธโฃ ๐ช๐ต๐ฎ๐ ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐๐ ๐ฎ๐ฟ๐ฒ
In classic RL, the environment is the world where the Agent lives, interacts, and get rewards to learn.
We can also think of them as software packages, containing data, harness and scoring rules - for the model
to learn and be evaluated.
Nowadays, the Agent is not just the LLM. It can use tools, from a weather API to a terminal.
This makes environments for training and evaluation more complex and critical.
3๏ธโฃ ๐๐ก๐ ๐จ๐ฉ๐๐ง ๐๐ก๐๐ฅ๐ฅ๐๐ง๐ ๐
Big labs are advancing, but open models and the community still face a fragmented ecosystem.
We risk becoming users of systems built with tools we can't access or fully understand.
4๏ธโฃ ๐๐ง๐ฏ๐ข๐ซ๐จ๐ง๐ฆ๐๐ง๐ญ๐ฌ ๐๐ฎ๐
That's why, I was excited when Prime Intellect released the Environments Hub.
It's a place where people share RL environments: tasks you can use to train LLMs with RL (GRPO-style) or evaluate Agents.
Plus, the Verifiers library (@willcb ) standardizes the creation of RL environments and evaluations.
They can help to keep science and experimentation open. ๐ฌ
I explored the Hub and wrote a hands-on walkthrough ๐
- RL + LLMs basics
- Environments Hub navigation
- Evaluating models/Agents
- GRPO Training a tiny model on an alphabetical sort task
Take a look!
๐ https://huggingface.co/blog/anakin87/environments-hub
๐ https://huggingface.co/blog/anakin87/environments-hub
RL environments help LLMs practice, reason, and improve.
I explored the Environments Hub and wrote a walkthrough showing how to train and evaluate models using these open environments.
1๏ธโฃ ๐ช๐ต๐ ๐ฅ๐ ๐บ๐ฎ๐๐๐ฒ๐ฟ๐ ๐ณ๐ผ๐ฟ ๐๐๐ ๐
DeepSeek-R1 made clear that Reinforcement Learning can be used to incentivize reasoning in LLMs.
In GRPO, the model generates multiple answers and learns to prefer the better ones from rewards.
2๏ธโฃ ๐ช๐ต๐ฎ๐ ๐ฒ๐ป๐๐ถ๐ฟ๐ผ๐ป๐บ๐ฒ๐ป๐๐ ๐ฎ๐ฟ๐ฒ
In classic RL, the environment is the world where the Agent lives, interacts, and get rewards to learn.
We can also think of them as software packages, containing data, harness and scoring rules - for the model
to learn and be evaluated.
Nowadays, the Agent is not just the LLM. It can use tools, from a weather API to a terminal.
This makes environments for training and evaluation more complex and critical.
3๏ธโฃ ๐๐ก๐ ๐จ๐ฉ๐๐ง ๐๐ก๐๐ฅ๐ฅ๐๐ง๐ ๐
Big labs are advancing, but open models and the community still face a fragmented ecosystem.
We risk becoming users of systems built with tools we can't access or fully understand.
4๏ธโฃ ๐๐ง๐ฏ๐ข๐ซ๐จ๐ง๐ฆ๐๐ง๐ญ๐ฌ ๐๐ฎ๐
That's why, I was excited when Prime Intellect released the Environments Hub.
It's a place where people share RL environments: tasks you can use to train LLMs with RL (GRPO-style) or evaluate Agents.
Plus, the Verifiers library (@willcb ) standardizes the creation of RL environments and evaluations.
They can help to keep science and experimentation open. ๐ฌ
I explored the Hub and wrote a hands-on walkthrough ๐
- RL + LLMs basics
- Environments Hub navigation
- Evaluating models/Agents
- GRPO Training a tiny model on an alphabetical sort task
Take a look!
๐ https://huggingface.co/blog/anakin87/environments-hub
reactedto sergiopaniego's post with ๐ฅ 7 months ago
Post
3967
You can now supercharge your TRL training pipelines with kernels
๐ท kernels is new library to load optimized compute kernels directly from the Hub
Combined with TRL, it makes you developer experience smoother & faster.
Check out the new guide to learn more! ๐บ
Learn โก๏ธ https://huggingface.co/docs/trl/main/en/kernels_hub
๐ท kernels is new library to load optimized compute kernels directly from the Hub
Combined with TRL, it makes you developer experience smoother & faster.
Check out the new guide to learn more! ๐บ
Learn โก๏ธ https://huggingface.co/docs/trl/main/en/kernels_hub
posted an update 8 months ago
Post
4745
Want to quickly try Gemma 3 270m? ๐๐ฌ
I made a simple Space to do that: anakin87/gemma-3-270m-it
โก Fast: Flash Attention, Zero GPU
โ๏ธ Configurable
I made a simple Space to do that: anakin87/gemma-3-270m-it
โก Fast: Flash Attention, Zero GPU
โ๏ธ Configurable
posted an update 8 months ago
Post
403
๐ต๏ธ๐ Building Browser Agents - notebook
No API? No problem.
Browser Agents can use websites like you do: click, type, wait, read.
๐ Step-by-step notebook: https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/browser_agents.ipynb
๐ฅ In the video, the Agent:
- Goes to Hugging Face Spaces
- Finds black-forest-labs/FLUX.1-schnell
- Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt
- Waits for the image
- Returns the image URL
## What else can it do?
Great for information gathering and summarization
๐๏ธ๐๏ธ Compare news websites and create a table of shared stories with links
โถ๏ธ Find content creator social profiles from YouTube videos
๐๏ธ Find a product's price range on Amazon
๐ ๐ Gather public transportation travel options
## How is it built?
๐๏ธ Haystack โ Agent execution logic
๐ง Google Gemini 2.5 Flash โ Good and fast LLM with a generous free tier
๐ ๏ธ Playwright MCP server โ Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps
- Try a local open model
- Move from notebook to real deployment
- Incorporate vision
And you? Have you built something similar? What's in your stack?
No API? No problem.
Browser Agents can use websites like you do: click, type, wait, read.
๐ Step-by-step notebook: https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/browser_agents.ipynb
๐ฅ In the video, the Agent:
- Goes to Hugging Face Spaces
- Finds black-forest-labs/FLUX.1-schnell
- Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt
- Waits for the image
- Returns the image URL
## What else can it do?
Great for information gathering and summarization
๐๏ธ๐๏ธ Compare news websites and create a table of shared stories with links
โถ๏ธ Find content creator social profiles from YouTube videos
๐๏ธ Find a product's price range on Amazon
๐ ๐ Gather public transportation travel options
## How is it built?
๐๏ธ Haystack โ Agent execution logic
๐ง Google Gemini 2.5 Flash โ Good and fast LLM with a generous free tier
๐ ๏ธ Playwright MCP server โ Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps
- Try a local open model
- Move from notebook to real deployment
- Incorporate vision
And you? Have you built something similar? What's in your stack?
reactedto mlabonne's post with ๐ฅ 8 months ago
Post
6973
Liquid just released two 450M and 1.6B param VLMs!
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B
They're super fast and leverage SigLIP2 NaFlex encoders to handle native resolutions without distortion. It's ideal for on-device deployment in constrained environments like phones.
It's available today on Hugging Face, with an inference and a fine-tuning Colab notebooks.
LiquidAI/LFM2-VL-450M
LiquidAI/LFM2-VL-1.6B
posted an update 8 months ago
Post
1097
Haystack can now see ๐
The latest release of the Haystack OSS LLM framework adds a long-requested feature: image support!
๐ Notebooks below
This isn't just about passing images to an LLM. We built several features to enable practical multimodal use cases.
What's new?
๐ง Support for multiple LLM providers: OpenAI, Amazon Bedrock, Google Gemini, Mistral, NVIDIA, OpenRouter, Ollama and more (support for Hugging Face API coming ๐)
๐๏ธ Prompt template language to handle structured inputs, including images
๐ PDF and image converters
๐ Image embedders using CLIP-like models
๐งพ LLM-based extractor to pull text from images
๐งฉ Components to build multimodal RAG pipelines and Agents
I had the chance of leading this effort with @sjrhuschlee (great collab).
๐ Below you can find two notebooks to explore the new features:
๓ ฏโข๓ ๓ Introduction to Multimodal Text Generation https://haystack.deepset.ai/cookbook/multimodal_intro
๓ ฏโข๓ ๓ Creating Vision+Text RAG Pipelines https://haystack.deepset.ai/tutorials/46_multimodal_rag
(๐ผ๏ธ image by @bilgeyucel )
The latest release of the Haystack OSS LLM framework adds a long-requested feature: image support!
๐ Notebooks below
This isn't just about passing images to an LLM. We built several features to enable practical multimodal use cases.
What's new?
๐ง Support for multiple LLM providers: OpenAI, Amazon Bedrock, Google Gemini, Mistral, NVIDIA, OpenRouter, Ollama and more (support for Hugging Face API coming ๐)
๐๏ธ Prompt template language to handle structured inputs, including images
๐ PDF and image converters
๐ Image embedders using CLIP-like models
๐งพ LLM-based extractor to pull text from images
๐งฉ Components to build multimodal RAG pipelines and Agents
I had the chance of leading this effort with @sjrhuschlee (great collab).
๐ Below you can find two notebooks to explore the new features:
๓ ฏโข๓ ๓ Introduction to Multimodal Text Generation https://haystack.deepset.ai/cookbook/multimodal_intro
๓ ฏโข๓ ๓ Creating Vision+Text RAG Pipelines https://haystack.deepset.ai/tutorials/46_multimodal_rag
(๐ผ๏ธ image by @bilgeyucel )
posted an update 9 months ago
Post
456
๐ก๏ธ AI Guardrails with Open Language Models - Tutorial
๐ https://haystack.deepset.ai/cookbook/safety_moderation_open_lms
How do you ensure your AI application is safe from harmful or inappropriate user inputs?
This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.
I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.
In the notebook, you'll learn how to use and customize:
๐น Meta Llama Guard (via Hugging Face API)
๐น IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions
๐น Google ShieldGemma (via Ollama)
๐น Nvidia NemoGuard models family, including a model for topic control
You'll also see how to integrate content moderation into a ๐ RAG pipeline.
๐ https://haystack.deepset.ai/cookbook/safety_moderation_open_lms
How do you ensure your AI application is safe from harmful or inappropriate user inputs?
This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.
I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.
In the notebook, you'll learn how to use and customize:
๐น Meta Llama Guard (via Hugging Face API)
๐น IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions
๐น Google ShieldGemma (via Ollama)
๐น Nvidia NemoGuard models family, including a model for topic control
You'll also see how to integrate content moderation into a ๐ RAG pipeline.
reactedto andito's post with ๐ 9 months ago
Post
4093
๐ง ๐๏ธ Can AI visualize solutions?
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal โmental sketchesโ?
Thatโs the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
๐ง Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.
๐ And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.
By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one thatโs faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal โmental sketchesโ?
Thatโs the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.
These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.
๐ง Mirage is trained in two phases:
1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.
๐ And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.
By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one thatโs faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)
posted an update 10 months ago
Post
1231
๐งฐ Free up space on the Hub with
As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).
This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. ๐
Besides deleting old, unused models, the main tool I used was a lesser-known command:
When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.
While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.
In these cases, you can use
https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history
โ ๏ธ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.
Hope this is useful to others.
super_squash_history ๐งนAs you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).
This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. ๐
Besides deleting old, unused models, the main tool I used was a lesser-known command:
super_squash_history.When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.
While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.
In these cases, you can use
super_squash_history: it reduces your entire repo history to a single commit.https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history
โ ๏ธ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.
Hope this is useful to others.
reactedto as-cle-bert's post with โค๏ธ 12 months ago
Post
1983
One of the biggest challenges I've been facing since I started developing [๐๐๐๐๐ญ๐๐จ๐ฐ๐ง](https://github.com/AstraBert/PdfItDown) was handling correctly the conversion of files like Excel sheets and CSVs: table conversion was bad and messy, almost unusable for downstream tasks๐ซฃ
That's why today I'm excited to introduce ๐ซ๐๐๐๐๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐
With ๐ณ๐ฆ๐ข๐ฅ๐ฆ๐ณ๐ด, you can choose among three (for now๐) flavors of text extraction and conversion to PDF:
- ๐๐ผ๐ฐ๐น๐ถ๐ป๐ด, which does a fantastic work with presentations, spreadsheets and word documents๐ฆ
- ๐๐น๐ฎ๐บ๐ฎ๐ฃ๐ฎ๐ฟ๐๐ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐ฆ
- ๐ ๐ฎ๐ฟ๐ธ๐๐๐๐ผ๐๐ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โ๏ธ
You can use this new feature in your python scripts (check the attached code snippet!๐) and in the command line interface as well!๐
Have fun and don't forget to star the repo on GitHub โก๏ธ https://github.com/AstraBert/PdfItDown
That's why today I'm excited to introduce ๐ซ๐๐๐๐๐ซ๐ฌ, the new feature of PdfItDown v1.4.0!๐
With ๐ณ๐ฆ๐ข๐ฅ๐ฆ๐ณ๐ด, you can choose among three (for now๐) flavors of text extraction and conversion to PDF:
- ๐๐ผ๐ฐ๐น๐ถ๐ป๐ด, which does a fantastic work with presentations, spreadsheets and word documents๐ฆ
- ๐๐น๐ฎ๐บ๐ฎ๐ฃ๐ฎ๐ฟ๐๐ฒ by LlamaIndex, suitable for more complex and articulated documents, with mixture of texts, images and tables๐ฆ
- ๐ ๐ฎ๐ฟ๐ธ๐๐๐๐ผ๐๐ป by Microsoft, not the best at handling highly structured documents, by extremly flexible in terms of input file format (it can even convert XML, JSON and ZIP files!)โ๏ธ
You can use this new feature in your python scripts (check the attached code snippet!๐) and in the command line interface as well!๐
Have fun and don't forget to star the repo on GitHub โก๏ธ https://github.com/AstraBert/PdfItDown
