diff --git "a/app/scripts/latex-importer/output/main.md" "b/app/scripts/latex-importer/output/main.md" --- "a/app/scripts/latex-importer/output/main.md" +++ "b/app/scripts/latex-importer/output/main.md" @@ -19,7 +19,7 @@ We sincerely hope this tutorial serves as a valuable starting point for your jou ## Introduction
- +

<span id="figure1" style="position: absolute;"></span>

lerobot is the open-source library for end-to-end robotics developed by Hugging Face. The library is vertically integrated on the entire robotics stack, supporting low-level control of real-world robot devices, advanced data and inference optimizations, as well as SOTA robot learning methods with simple implementations in pure Pytorch.
@@ -30,7 +30,7 @@ The frontier of robotics research is indeed increasingly moving away from classi Moreover, since end-to-end learning on ever-growing collections of text and image data has historically been at the core of the development of *foundation models* capable of semantic reasoning across multiple modalities (images, text, audio, etc.), deriving robotics methods grounded in learning appears particularly consequential, especially as the number of openly available datasets continues to grow. -Robotics is, at its core, an inherently multidisciplinary field, requiring a wide range of expertise in both *software* and *hardware*. The integration of learning-based techniques further broadens this spectrum of skills, raising the bar for both research and practical applications. `lerobot` is an open-source library designed to integrate end-to-end with the entire robotics stack. With a strong focus on accessible, real-world robots \(1) `lerobot` supports many, openly available, robotic platforms\ for manipulation, locomotion and even whole-body control. `lerobot`also implements a \(2) unified, low-level approach to reading/writing robot configurations\ to extend support for other robot platforms with relatively low effort. The library introduces `LeRobotDataset`, \(3) a native robotics dataset’s format\ currently being used by the community to efficiently record and share datasets. `lerobot` also supports many state-of-the-art (SOTA) algorithms in robot learning--mainly based on Reinforcement Learning (RL) and Behavioral Cloning (BC) techniques--with efficient implementations in Pytorch, and extended support to experimentation and experiments tracking. Lastly, `lerobot` defines a custom, optimized inference stack for robotic policies decoupling action planning from action execution, proving effective in guaranteeing more adaptability at runtime. +Robotics is, at its core, an inherently multidisciplinary field, requiring a wide range of expertise in both *software* and *hardware*. The integration of learning-based techniques further broadens this spectrum of skills, raising the bar for both research and practical applications. `lerobot` is an open-source library designed to integrate end-to-end with the entire robotics stack. With a strong focus on accessible, real-world robots \(1) `lerobot` supports many, openly available, robotic platforms\ for manipulation, locomotion and even whole-body control. `lerobot`also implements a \(2) unified, low-level approach to reading/writing robot configurations\ to extend support for other robot platforms with relatively low effort. The library introduces `LeRobotDataset`, \(3) a native robotics dataset’s format\ currently being used by the community to efficiently record and share datasets. `lerobot` also supports many state-of-the-art (SOTA) algorithms in robot learning--mainly based on Reinforcement Learning (RL) and Behavioral Cloning (BC) techniques--with efficient implementations in Pytorch, and extended support to experimentation and experiments tracking. Lastly, `lerobot` defines a custom, optimized inference stack for robotic policies decoupling action planning from action execution, proving effective in guaranteeing more adaptability at runtime. This tutorial serves the double purpose of providing useful references for the Science behind--and practical use of--common robot learning techniques. To this aim, we strike to provide a rigorous yet concise overview of the core concepts behind the techniques presented, paired with practical examples of how to use such techniques concretely, with code examples in `lerobot`, for researchers and practitioners interested in the field of robot learning. This tutorial is structured as follows: @@ -229,19 +229,19 @@ TL;DR Learning-based approaches to robotics are motivated by the need to (1) gen ### Explicit and Implicit Models
- +

<span id="generating-motion-atlas" style="position: absolute;"></span>

Overview of methods to generate motion (clearly non-exhausitve, see @bekrisStateRobotMotion2024). The different methods can be grouped based on whether they explicitly (dynamics-based) or implicitly (learning-based) model robot-environment interactions.
-Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines @connellRobotLearning1993. As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \producing artificial motion in the physical world\. +Robotics is concerned with producing artificial motion in the physical world in useful, reliable and safe fashion. Thus, robotics is an inherently multi-disciplinar domain: producing autonomous motion in the physical world requires, to the very least, interfacing different software (motion planners) and hardware (motion executioners) components. Further, knowledge of mechanical, electrical, and software engineering, as well as rigid-body mechanics and control theory have therefore proven quintessential in robotics since the field first developed in the 1950s. More recently, Machine Learning (ML) has also proved effective in robotics, complementing these more traditional disciplines @connellRobotLearning1993. As a direct consequence of its multi-disciplinar nature, robotics has developed as a rather wide array of methods, all concerned with the main purpose of \producing artificial motion in the physical world\. -Methods to produce robotics motion range from traditional *explicit* models--\dynamics-based\[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--\learning-based\ methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure [generating-motion-atlas] graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of \learning-based approaches to robotics\--the core focus on this tutorial. +Methods to produce robotics motion range from traditional *explicit* models--\dynamics-based\[^1] methods, leveraging precise descriptions of the mechanics of robots’ rigid bodies and their interactions with eventual obstacles in the environment--to *implicit* models--\learning-based\ methods, treating artificial motion as a statistical pattern to learn given multiple sensorimotor readings @agrawalComputationalSensorimotorLearning, @bekrisStateRobotMotion2024. A variety of methods have been developed between these two extrema. For instance,  @hansenTemporalDifferenceLearning2022 show how learning-based systems can benefit from information on the physics of problems, complementing a traditional learning method such as Temporal Difference (TD)-learning @suttonReinforcementLearningIntroduction2018 with Model-Predictive Control (MPC). Conversely, as explicit models may be relying on assumptions proving overly simplistic--or even unrealistic--in practice, learning can prove effective to improve modeling of complex phenomena or complement perception @mccormacSemanticFusionDense3D2016. Such examples aim at demonstrating the richness of approaches to robotics, and Figure [generating-motion-atlas] graphically illustrates some of the most relevant techniques. Such a list is clearly far from being exhaustive, and we refer to @bekrisStateRobotMotion2024 for a more comprehensive overview of both general and application-specific methods for motion generation. In this section, we wish to introduce the inherent benefits of \learning-based approaches to robotics\--the core focus on this tutorial. ### Different Types of Motion
- +

<span id="robotics-platforms-atlas" style="position: absolute;"></span>

Different kinds of motions are achieved with potentially very different robotic platforms. From left to right, top to bottom: ViperX, SO-100, Boston Dynamics’ Spot, Open-Duck, 1X’s NEO, Boston Dynamics’ Atlas. This is an example list of robotic platforms and is (very) far from being exhaustive.
@@ -259,7 +259,7 @@ Robot manipulators typically consist of a series of links and joints, articulate Recently, the development of low-cost manipulators like the ALOHA @zhaoLearningFineGrainedBimanual2023 ALOHA-2 @aldacoALOHA2Enhanced and SO-100/SO-101 @knightStandardOpenSO100 platforms significantly lowered the barrier to entry to robotics, considering the increased accessibility of these robots compared to more traditional platforms like the Franka Emika Panda arm (Figure [robotic-platforms-costs]).
- +

<span id="robotic-platforms-costs" style="position: absolute;"></span>

Cheaper, more accessible robots are starting to rival traditional platforms like the Panda arm platforms in adoption in resource-constrained scenarios. The SO-100, in particular, has a cost in the 100s of Euros, and can be entirely 3D-printed in hours, while the industrially-manufactured Panda arm costs tens of thousands of Euros and is not openly available.
@@ -267,7 +267,7 @@ Recently, the development of low-cost manipulators like the ALOHA @zhaoLearning Deriving an intuition as per why learning-based approaches are gaining popularity in the robotics community requires briefly analyzing traditional approaches for manipulation, leveraging tools like forward and inverse kinematics (FK, IK) and control theory. Providing a detailed overview of these methods falls (well) out of the scope of this tutorial, and we refer the reader to works including @sicilianoSpringerHandbookRobotics2016, @lynchModernRoboticsMechanics2017, @tedrakeRoboticManipulationPerception, @tedrakeUnderactuatedRoboticsAlgorithms for a much more comprehensive description of these techniques. Here, we mostly wish to highlight the benefits of ML over these traditional techniques
- +

<span id="make-so100-planar-manipulator" style="position: absolute;"></span>

The SO-100 arm is a 6-dof manipulator arm. Preventing some of its joints (shoulder pane, wrist flex and wrist roll) from actuating, it can be represented as a traditional 2-dof planar manipulator (the gripper joint in the end-effector is not considered towards the count of the degrees of freedom used to produce motion).
@@ -280,17 +280,17 @@ All these simplifying assumptions leave us with the planar manipulator of Figure
- +

<span id="planar-manipulation-simple" style="position: absolute;"></span>

Free to move
- +

<span id="planar-manipulator-floor" style="position: absolute;"></span>

Constrained by the surface
- +

<span id="planar-manipulator-floor-shelf" style="position: absolute;"></span>

Constrained by surface and (fixed) obstacle
@@ -348,7 +348,7 @@ While very effective when a goal trajectory has been well specified, the perform
-r0.3 image +r0.3 image \\ @@ -369,18 +369,18 @@ We point the interested reader to , , and  for extended coverage of FK, IK, di Despite the last 60+ years of robotics research, autonomous robots are still largely incapable of performing tasks at human-level performance in the physical world generalizing across (1) robot embodiments (different manipulators, different locomotion platforms, etc.) and (2) tasks (tying shoe-laces, manipulating a diverse set of objects). While essential in the early development of robotics, the aforementioned methods require significant human expertise to be used in practice, and are typically specific to a particular applicative problem.
- +

<span id="classical-limitations" style="position: absolute;"></span>

Dynamics-based approaches to robotics suffer from several limitations: (1) orchestrating multiple components poses integration challenges; (2) the need to develop custom processing pipelines for the sensing modalities and tasks considered hinders scalability; (3) simplified analytical models of physical phenomena (here friction at the gripper; credits to @antonovaReinforcementLearningPivoting2017) limit real-world performance. Lastly, (4) dynamics-based methods overlook trends in the availability and growth of robotics data.
-Dynamics-based robotics pipelines have historically been \developed sequentially, engineering the different blocks\ now within most architectures for specific purposes. That is, sensing, state estimation, mapping, planning, (diff-)IK, and low-level control have been traditionally developed as distinct modules with fixed interfaces. Pipelining these specific modules proved error-prone, and brittleness emerges--alongside compounding errors--whenever changes incur (e.g., changes in lighting for sensing, occlusion/failure of sensors, control failures). Adapting such a stack to new tasks or robotic platforms often entails re-specifying objectives, constraints, and heuristics at multiple stages, incurring significant engineering overhead. +Dynamics-based robotics pipelines have historically been \developed sequentially, engineering the different blocks\ now within most architectures for specific purposes. That is, sensing, state estimation, mapping, planning, (diff-)IK, and low-level control have been traditionally developed as distinct modules with fixed interfaces. Pipelining these specific modules proved error-prone, and brittleness emerges--alongside compounding errors--whenever changes incur (e.g., changes in lighting for sensing, occlusion/failure of sensors, control failures). Adapting such a stack to new tasks or robotic platforms often entails re-specifying objectives, constraints, and heuristics at multiple stages, incurring significant engineering overhead. -Moreover, classical planners operate on compact, assumed-sufficient state representations; extending them to reason directly over raw, heterogeneous and noisy data streams is non-trivial. This results in a \limited scalability to multimodal data and multitask settings\, as incorporating high-dimensional perceptual inputs (RGB, depth, tactile, audio) traditionally required extensive engineering efforts to extract meaningful features for control. Also, the large number of tasks, coupled with the adoption of *per-task* planners, goal parameterizations, and safety constraints, results in an explosion in design and validation options, with little opportunity to reuse solutions across tasks. +Moreover, classical planners operate on compact, assumed-sufficient state representations; extending them to reason directly over raw, heterogeneous and noisy data streams is non-trivial. This results in a \limited scalability to multimodal data and multitask settings\, as incorporating high-dimensional perceptual inputs (RGB, depth, tactile, audio) traditionally required extensive engineering efforts to extract meaningful features for control. Also, the large number of tasks, coupled with the adoption of *per-task* planners, goal parameterizations, and safety constraints, results in an explosion in design and validation options, with little opportunity to reuse solutions across tasks. -Setting aside integration and scalability challenges: developing accurate modeling of contact, friction, and compliance for complicated systems remains difficult. Rigid-body approximations are often insufficient in the presence of deformable objects, and \relying on approximated models hinders real-world applicability\ of the methods developed. In the case of complex, time-dependent and/or non-linear dynamics, even moderate mismatches in parameters, unmodeled evolutions, or grasp-induced couplings can qualitatively affect the observed dynamics. +Setting aside integration and scalability challenges: developing accurate modeling of contact, friction, and compliance for complicated systems remains difficult. Rigid-body approximations are often insufficient in the presence of deformable objects, and \relying on approximated models hinders real-world applicability\ of the methods developed. In the case of complex, time-dependent and/or non-linear dynamics, even moderate mismatches in parameters, unmodeled evolutions, or grasp-induced couplings can qualitatively affect the observed dynamics. -Lastly, dynamics-based methods (naturally) overlook the rather recent \increase in availability of openly-available robotics datasets\. The curation of academic datasets by large centralized groups of human experts in robotics @oneillOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 is now increasingly complemented by a \growing number of robotics datasets contributed in a decentralized fashion\ by individuals with varied expertise. If not tangentially, dynamics-based approaches are not posed to maximally benefit from this trend, which holds the premise of allowing generalization in the space of tasks and embodiments, like data was the cornerstone for advancements in vision @alayracFlamingoVisualLanguage2022 and natural-language understanding @brownLanguageModelsAre2020. +Lastly, dynamics-based methods (naturally) overlook the rather recent \increase in availability of openly-available robotics datasets\. The curation of academic datasets by large centralized groups of human experts in robotics @oneillOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 is now increasingly complemented by a \growing number of robotics datasets contributed in a decentralized fashion\ by individuals with varied expertise. If not tangentially, dynamics-based approaches are not posed to maximally benefit from this trend, which holds the premise of allowing generalization in the space of tasks and embodiments, like data was the cornerstone for advancements in vision @alayracFlamingoVisualLanguage2022 and natural-language understanding @brownLanguageModelsAre2020. Taken together, these limitations (Figure [classical-limitations]) motivate the exploration of learning-based approaches that can (1) integrate perception and control more tightly, (2) adapt across tasks and embodiments with reduced expert modeling interventions and (3) scale gracefully in performance as more robotics data becomes available. @@ -403,18 +403,18 @@ TL;DR The need for expensive, high-fidelity simulators can be obviated learning
- +

<span id="robot-learning-upsides" style="position: absolute;"></span>

Learning-based robotics streamlines perception-to-action by learning a (1) unified high-level controller capable to take (2) high-dimensional, unstructured sensorimotor information. Learning (3) does not require a dynamics model and instead focuses on interaction data, and (4) empirically correlates with the scale of the data used.
Learning-based techniques for robotics naturally address the limitations presented in Section [classical] (Figure [robot-learning-upsides]). In particular, learning-based techniques typically rely on monolithich prediction-to-action pipelines (*visuomotor policies*) which do directly map sensorimotor inputs to predicted actions, streamlining control policies by removing the need to interface multiple components. Mapping sensory inputs to actions also makes it possible to incorporate diverse input modalities, leveraging the automatic feature extraction capabilities of modern learning systems. Moreover, learning-based approaches can, in principle, bypass explicit modeling altogether and instead rely solely on interaction data--an advantage that proves transformative when dynamics are difficult to model or entirely unknown. Lastly, learning for robotics (*robot learning*) is naturally well posed to leverage the growing amount of robotics data openly available, just as computer vision and natural language processing did historically benefit from large-scale corpora of data, in great part overlooked by dynamics-based approaches. -Being a field at its relative nascent stages, no prevalent technique(s) proves distinctly better than any other in the domain of robot learning. Still, two major classes of methods gained prominence- \Reinforcement Learning (RL)\ and \Behavioral Cloning (BC)\ (Figure [robot-learning-atlas]). In this section, we provide a conceptual overview of applications of RL to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in Section [learning-imitation] and Section sec-learning-foundation. +Being a field at its relative nascent stages, no prevalent technique(s) proves distinctly better than any other in the domain of robot learning. Still, two major classes of methods gained prominence- \Reinforcement Learning (RL)\ and \Behavioral Cloning (BC)\ (Figure [robot-learning-atlas]). In this section, we provide a conceptual overview of applications of RL to robotics, as well as introduce practical examples of how to use RL within `lerobot`. We then introduce the major limitations RL suffers from, to introduce BC techniques in Section [learning-imitation] and Section sec-learning-foundation.
-r0.3 image +r0.3 image \\ @@ -423,7 +423,7 @@ r0.3 [robot-learning-atlas] we deliberately include generalist robot models @blackp0VisionLanguageActionFlow2024, @shukorSmolVLAVisionLanguageActionModel2025 alongside task-specific BC methods. While significantly different in spirit--*generalist* models are language-conditioned and use instructions to generate motion valid across many tasks, while *task-specific* models are typically not language-conditioned and used to perform a single task--*foundation* models are still largely trained to reproduce trajectories contained in a (large) training set of input demonstrations. Thus, we argue generalist policies can indeed be grouped alongside other task-specific BC methods, as they both leverage similar training data and schemas. Figure [robot-learning-atlas] illustrates this categorization graphically, explicitly listing all the robot learning policies currently available in `lerobot`- Action Chunking with Transformers (ACT) @zhaoLearningFineGrainedBimanual2023, Diffusion Policy @chiDiffusionPolicyVisuomotor2024, Vector-Quantized Behavior Transformer (VQ-BeT) @leeBehaviorGenerationLatent2024, $`\pi_0`$ @blackp0VisionLanguageActionFlow2024, SmolVLA @shukorSmolVLAVisionLanguageActionModel2025, Human-in-the-loop Sample-efficient RL (HIL-SERL) @luoPreciseDexterousRobotic2024 and TD-MPC @hansenTemporalDifferenceLearning2022.
- +

<span id="robotics-with-rl-examples" style="position: absolute;"></span>

Examples of two different robotics tasks performed using RL. In the manipulation task (A) an agent learns to reach for a yellow plastic block in its environment, and to put it inside of a box. In the locomotion task (B) an agent learns to move its center of mass sideways without falling.
@@ -435,7 +435,7 @@ Applications of RL to robotics have been studied long enough that the relationsh The RL framework @suttonReinforcementLearningIntroduction2018, which we briefly introduce here, has often been used to tackle robotics problems @koberReinforcementLearningRobotics. RL is a subfield within ML fundamentally concerned with the development of autonomous systems (*agents*) capable to *continuously behave* in an evolving environment, developing (ideally, well-performing) control strategies (*policies*). Crucially for robotics, RL agents improve through trial and error, bypassing explicit models of the problem dynamics in favor of interaction data. In RL, this feedback loop between actions and outcomes (Figure [rl-most-famous-pic]) is established through the agent sensing a scalar quantity (*reward*) measuring how desirable a given *transition* is for the accomplishment of its goal.
- +

<span id="rl-most-famous-pic" style="position: absolute;"></span>

Agent-Environment interaction diagram (image credits to @suttonReinforcementLearningIntroduction2018).
@@ -507,7 +507,7 @@ V_\pi(s_t) &= \mathbb E_{a_t\sim \pi(\bullet \vert s_t)} [Q_\pi (s_t, a_t)], inducing an ordering over states and state-action pairs under $`\pi`$, and value functions are thus central to most RL algorithms. A variety of algorithms have been developed in RL attempting to find (approximate) solutions to the problem of maximizing cumulative reward (we report some in Figure [rl-algos-atlas]).
- +

<span id="rl-algos-atlas" style="position: absolute;"></span>

Popular RL algorithms. See @SpinningUp2018 for a complete list of citations.
@@ -518,10 +518,10 @@ Popular approaches to continuous state and action space--such as those studied w Streamlined end-to-end control pipelines, data-driven feature extraction and a disregard for explicit modeling in favor of interaction data are all features of RL for robotics. However, RL still suffers from limitations concerning safety and learning efficiency, particularly pressing for real-world robotics applications. -First, especially early in training, \actions are typically explorative, and thus may be erractic\. On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage. Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision. Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and error-prone process. Second, learning efficiently remains problematic in RL, \limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training\. Even strong algorithms such as SAC @haarnojaSoftActorCriticOffPolicy2018 typically require a large numbers of transitions $`\{ (s_t, a_t, r_t, s_{t+1})\}_{t=1}^N`$. On real-world hardware, generating this data is time-consuming. +First, especially early in training, \actions are typically explorative, and thus may be erractic\. On physical systems, untrained policies may command high velocities, self-collisiding configurations, or torques exceeding joint limits, leading to wear and potential hardware damage. Mitigating these risks requires external safeguards (e.g., watchdogs, safety monitors, emergency stops), often incuring in a high degree of human supervision. Further, in the typical episodic setting considered in most robotics problems, experimentation is substantially slowed down by the need to manually reset the environment over the course of training, a time-consuming and error-prone process. Second, learning efficiently remains problematic in RL, \limiting the applicability of RL in real-world robotics due to consequently prohibitive timescales of training\. Even strong algorithms such as SAC @haarnojaSoftActorCriticOffPolicy2018 typically require a large numbers of transitions $`\{ (s_t, a_t, r_t, s_{t+1})\}_{t=1}^N`$. On real-world hardware, generating this data is time-consuming.
- +

<span id="synthetic-vs-real-duck" style="position: absolute;"></span>

Simulated (left) vs. real-world (right) OpenDuck. Discrepancies in the simulation dynamics (reality gap) pose risks to policy transfer.
@@ -529,7 +529,7 @@ First, especially early in training, \actions are typic Training RL policies in simulation @tobinDomainRandomizationTransferring2017 addresses both issues, eliminating physical risk and dramatically increasing throughput. Yet, simulators require significant modeling effort, and rely on assumptions (simplified physical modeling, instantaneous actuation, static environmental conditions, etc.) limiting the possibilities to transfer the policies learned in simulation, due the discrepancy between real and simulated environments (*reality gap*, Figure [synthetic-vs-real-duck]). *Domain randomization* @tobinDomainRandomizationTransferring2017 (DR) is a popular technique to overcome the reality gap, and consists in randomizing the parameters of the simulated environment during training, aiming at inducing robustness to specific disturbances. In this, DR is typically employed to increase the diversity of scenarios over the course of training, improving on the performace sim-to-real transferred policies @akkayaSolvingRubiksCube2019, @antonovaReinforcementLearningPivoting2017, @jiDribbleBotDynamicLegged2023. In practice, DR is performed training in simulation on simulated dynamics $`\mathcal D`$, further parametrized as $`\mathcal D \equiv \mathcal D_\xi`$, with a *dynamics* (random) vector $`\xi`$ drawn an arbitrary distribution, $`\xi \sim \Xi`$. For instance, one could decide to randomize the friction coefficient of the surface in a locomotion task (Figure [ducks-on-terrains]), or the center of mass of an object for a manipulation task. Over the course of training--typically at each episode’s reset--a new $`\xi`$ is drawn, and used to specify the environment’s dynamics for that episode.
- +

<span id="ducks-on-terrains" style="position: absolute;"></span>

The same locomotion task can be carried out in different (simulated) domains (exemplified by the difference in terrains) at training time, resulting to increased robustness over diverse environment dynamics.
@@ -538,7 +538,7 @@ While effective in transfering policies across the reality gap in real-world rob Selecting the dynamics distribution $`\Xi`$ is also non-trivial. On the one hand, distributions with low entropy might risk to cause failure at transfer time, due to the limited robustness induced over the course of training. On the other hand, excessive randomization may cause over-regularization and hinder performance @margolisRapidLocomotionReinforcement2022. Consequently, the research community investigated approaches to automatically select the randomization distribution $`\Xi`$, using signals from the training process or tuning it to reproduce observed real-world trajectories. @akkayaSolvingRubiksCube2019 use a parametric uniform distribution $`\mathcal U(a, b)`$ as $`\Xi`$, widening the bounds $`a, b`$ as training progresses and the agent’s performance improves (AutoDR). While effective, AutoDR requires significant tuning--the bounds are widened by a fixed, pre-specified amount $`\Delta`$ along--and may disregard data when performance *does not* improve after a distribution update @tiboniDomainRandomizationEntropy2024. @tiboniDomainRandomizationEntropy2024 propose a similar method to AutoDR (DORAEMON) to evolve $`\Xi`$ based on the training signal, but with the key difference of explicitly maximizing the entropy of a parametric Beta distribution--inherently more flexible than uniform distributions--with learned updates instead of fixed $`\Delta`$. In this, DORAEMON proves particularly effective at dynamically increasing the entropy levels of the training distribution by employing an outer-loop max-entropy objective, tackled under performance constraints in the inner-loop RL problem. Other approaches to automatically perform DR consist in specifically tuning $`\Xi`$ to align as much as possible the simulation and real-world domains. For instance, @chebotarClosingSimtorealLoop2019 interleave in-simulation policy training with repeated real-world policy rollouts used to adjust $`\Xi`$ based on real-world data, while @tiboniDROPOSimtoRealTransfer2023 leverage a single, pre-collected set of real-world trajectories and tune $`\Xi`$ under a simple likelihood objective. -While DR has shown promise, it does not address the main limitation that, even under the assumption that an ideal distribution $`\Xi`$ was available, many robotics problems \cannot be simulated with high-enough fidelity under practical computational constraints\. Simulating contact-rich manipulation of possibly deformable or soft materials--i.e., *folding a piece of clothing*--can prove time-intensive, limiting the benefits of in-simulation training. +While DR has shown promise, it does not address the main limitation that, even under the assumption that an ideal distribution $`\Xi`$ was available, many robotics problems \cannot be simulated with high-enough fidelity under practical computational constraints\. Simulating contact-rich manipulation of possibly deformable or soft materials--i.e., *folding a piece of clothing*--can prove time-intensive, limiting the benefits of in-simulation training. A perhaps more foundamental limitation of RL for robotics is the general unavailability of complicated tasks’ *dense* reward function, the design of which is essentially based on human expertise, ingenuity and trial-and-error. In practice, *sparse* reward functions can be used to conclude whether one specific goal has been attained--*has this t-shirt been correctly folded?*--but unfortunately incur in more challenging learning. As a result, despite notable successes, deploying RL directly on real-world robots at scale remains challenging. @@ -589,7 +589,7 @@ Provably, eq. maximizing the discounted cumulative reward, while acting as randomly as possible\. MaxEnt RL @haarnojaReinforcementLearningDeep2017b has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $`J (\pi)`$ to specifically account for the policy entropy $`\mathcal H(\pi (\bullet \vert s_t))`$, +Soft Actor-Critic (SAC) @haarnojaSoftActorCriticOffPolicy2018 is a derivation of DDPG in the max-entropy (MaxEnt) RL framework, in which RL agents are tasked with \maximizing the discounted cumulative reward, while acting as randomly as possible\. MaxEnt RL @haarnojaReinforcementLearningDeep2017b has proven particularly robust thanks to the development of diverse behaviors, incentivized by its entropy-regularization formulation. In that, MaxEnt revisits the RL objective $`J (\pi)`$ to specifically account for the policy entropy $`\mathcal H(\pi (\bullet \vert s_t))`$, ``` math @@ -623,7 +623,7 @@ Reward classifiers are particularly useful in treating complex, dynamic tasks--e Lastly, in order to improve on the robustness of their approach to different goals while maintaing practical scalability, @luoSERLSoftwareSuite2025 introduced a modified state and action space, expressing proprioperceptive configurations $`q`$ and actions $`\dot q`$ in the frame of the end-effector pose at $`t=0`$. Randomizing the initial pose of the end-effector ($`s_0`$), @luoSERLSoftwareSuite2025 achieved a similar result to that of manually randomizing the environment at every timestep, but with the benefit of maintaining the environment in the same condition across multiple training episodes, achieving higher scalability of their method thanks to the increased practicality of their approach.
- +

<span id="hil-serl-blocks" style="position: absolute;"></span>

(A) HIL-SERL allows for real-world training of high performance RL agents by building on top advancements presented by of SAC, RLPD and SERL. (B) Example of human intervention during a HIL-SERL training process on a real-world SO-100.
@@ -635,7 +635,7 @@ Human-in-the-Loop, Sample Efficient Robot reinforcement Learning (HIL-SERL) @lu #### Code Example- Real-world RL
- +

<span id="ch3-hil-serl-architecture" style="position: absolute;"></span>

HIL-SERL is a SOTA RL algorithm for training control policies directly in the real-world. Its implementation in lerobot relies on a decoupled actor-learner architecture, communicating over processes (and possibly networks) with queues used to share (1) transitions (st, at, rt, st + 1) and (2) parameters θ.
@@ -919,17 +919,17 @@ TL;DR Behavioral Cloning provides a natural platform to learn from real-world in
- +

<span id="ch4-bc-trajectories" style="position: absolute;"></span>

(A) Average (with standard deviation) evolution of the actuation levels over the first 5 recorded episodes in lerobot/svla_so101_pickplace. Proprioperceptive states provide invaluable to determine the robot’s state during an episode. (B) Camera frames are also recorded alongside measurements on the robot’s state, capturing information about the robot’s interaction with its environment.
-Learning from human demonstrations provides a pragmatic alternative to the RL pipeline discussed in Section [learning-rl]. Indeed, especially in real-world robotics, online exploration is typically \costly and potentially unsafe\, and designing (dense) reward signals is a \brittle and task-specific\ process. Further, even success detection itself often requires bespoke instrumentation, while episodic training demands reliable resets--all factors complicating training RL algorithms on hardware at scale. Behavioral Cloning (BC) sidesteps these constraints by \casting control an imitation learning problem\, leveraging previously collected expert demonstrations to anchor the learned autonomous behavior. Most notably, by *learning-to-imitate*, autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. +Learning from human demonstrations provides a pragmatic alternative to the RL pipeline discussed in Section [learning-rl]. Indeed, especially in real-world robotics, online exploration is typically \costly and potentially unsafe\, and designing (dense) reward signals is a \brittle and task-specific\ process. Further, even success detection itself often requires bespoke instrumentation, while episodic training demands reliable resets--all factors complicating training RL algorithms on hardware at scale. Behavioral Cloning (BC) sidesteps these constraints by \casting control an imitation learning problem\, leveraging previously collected expert demonstrations to anchor the learned autonomous behavior. Most notably, by *learning-to-imitate*, autonomous systems naturally adhere to the objectives, preferences, and success criteria implicitly encoded in the data, which reduces early-stage exploratory failures and obviates hand-crafted reward shaping altogether. Formally, let $`\mathcal D = \{ \tau^{(i)} \}_{i=1}^N`$ be a set of expert trajectories, with $`\tau^{(i)} = \{(o_t^{(i)}, a_t^{(i)})\}_{t=0}^{T_i}`$ representing the $`i`$-th length-$`T_i`$ trajectory in $`\mathcal D`$, $`o_t \in \mathcal O`$ denoting observations (e.g., images and proprioception altogether), and $`a_t \in \mathcal A`$ the expert actions. Typically, observations $`o \in \mathcal O`$ consist of both image and proprioperceptive information, while actions $`a \in \mathcal A`$ represent control specifications for the robot to execute, e.g. a joint configuration. Note that differently from Section [learning-rl], in the imitation learning context $`\mathcal D`$ denotes an offline dataset collecting $`N`$ length-$`T_i`$ reward-free (expert) human trajectories $`\tau^{(i)}`$, and *not* the environment dynamics. Similarily, in this section $`\tau^{(i)}`$ represent a length-$`T_i`$ trajectory of observation-action pairs, which crucially *omits entirely any reward* information. Figure [ch4-bc-trajectories] graphically shows trajectories in terms of the average evolution of the actuation on the 6 joints of a teleoperated SO-100 manipulator. Notice how proprioperceptive states are captured jointly with camera frames over the course of the recorded episodes, providing a unified high-frame rate collection of both image and joint teleoperation data. Figure [ch4-observation-action-mapping] shows $`(o_t, a_t)`$-pairs for the same dataset, with the actions performed by the human expert illustrated alongside the corresponding observation. In principle, (expert) trajectories $`\tau^{(i)}`$ can have different lengths since demonstrations might exhibit multi-modal strategies to attain the same goal, resulting in multiple, different behaviors.
- +

<span id="ch4-observation-action-mapping" style="position: absolute;"></span>

Sample observations and action pairs over the course of a given trajectory recorded in lerobot/svla_so101_pickplace. Observations, comprising of both proprioperceptive and visual information, are recorded alongside the configuration of a second, leader robot controlled by a human expert, providing complete information for regressing actions given observations.
@@ -945,7 +945,7 @@ Typically, the expert’s joint observation-action distribution $`p: \mathcal O\ Despite the inherent challenges of learning from non-i.i.d. data, the BC formulation presents several operational advantages in robotics. First, training happens offline and naturally accomodates for expert, demonstration data, hereby severily limiting exploration risks by preventing the robot from performing dangerous actions altogether, by anchoring action in imitation. Second, reward design is entirely unnecessary in BC, as demonstrations already reflect human intent. The absence of rewards also prevents the risk of misalignment and specification gaming (*reward hacking*), otherwise inherent in purely reward-based RL @heessEmergenceLocomotionBehaviours2017. Third, because expert trajectories encode terminal conditions, success detection and resets are implicit in the dataset. Finally, empirical evidence suggests the performance of BC scales naturally with growing corpora of demonstrations collected across tasks, embodiments, and environments. Nonetheless, BC can, in principle, only reproduce behaviors that are at best as good as those of the demonstrator, and therefore offers no remedy for the suboptimal decisions that humans may enact. This limitation is particularly problematic in sequential decision-making tasks where expert demonstrations are scarce---either because data collection is costly or because human performance is inherently suboptimal. Yet, many robotics applications still benefit from relatively inexpensive pipelines for collecting high-quality human-generated trajectories, justifying the use of BC in such settings.
- +

<span id="ch4-issues-with-bc" style="position: absolute;"></span>

Point-wise policies suffer from limitations due to (A) covariate shifts and (B) poor approximation of multimodal demonstrations. (A) Small errors may drive the policy out of distribution, incuring in a vicious circle ultimately resulting in failure. (B) Both modes of reaching for a target object in the scene--either left or right-first--are equally as good and thus equally as likely to be present in a dataset of human demonstrations, ultimately resulting in multimodal demonstrations.
@@ -959,7 +959,7 @@ Generative Models (GMs) aim to learn the stochastic process underlying the very #### Variational Auto-Encoders
- +

<span id="ch4-task-effect-on-pairs" style="position: absolute;"></span>

Intuitively, latent variable in a single latent model may contain information regarding the task being performed, which directly results in the likelihood of the same observation-action pair being different for two different tasks. When (A) picking a block the likelihood of a wide gripper’s opening should be higher than narrower one, while it should be the opposite when (B) pushing the block.
@@ -971,7 +971,7 @@ A common inductive bias used in GM posits samples $`(o,a)`$ are influenced from Intuitively, in the case of observation-action pairs $`(o, a)`$ for a robotics application, $`z`$ could be interpreted as some high level representation of the underlying task being performed by the human demonstrator. In such case, treating $`p(o,a)`$ as a marginalization over $`\operatorname{supp}({Z})`$ of the complete joint distribution $`p(o,a,z)`$ natively captures the effect different tasks have on the likelihood of observation-action pairs. Figure [ch4-task-effect-on-pairs] graphically illustrates this concept in the case of a (A) picking and (B) pushing task, for which, nearing the target object, the likelihood of actions resulting in opening the gripper--the higher $`q_6`$, the wider the gripper’s opening--should intuitively be (A) high or (B) low, depending on the task performed. While the latent space $`Z`$ typically has a much richer structure than the set of all actual tasks performed, eq. [BC-latent-variable] still provides a solid framework to learn joint distribution conditioned on unobservable yet relevant factors. Figure [ch4-latent-variable-model] represents this latent-variable framework in the context of a robotics application- the true, $`z`$-conditioned generative process assigns *likelihood* $`p((o,a) \vert z)`$ to the single $`(o,a)`$-pair. Using Bayes’ theorem, one can reconstruct the *posterior* distribution on $`\operatorname{supp}({Z})`$, $`q_\theta(z \vert o,a)`$ from the likelihood $`p_\theta(o,a \vert z)`$, *prior* $`p_\theta(z)`$ and *evidence* $`p_\theta(o,a)`$. VAEs approximate the latent variable model presented in eq. [BC-latent-variable] using an *approximate posterior* $`q_\phi(z \vert o,a)`$ while regressing parameters for a parametric likelihood, $`p_\theta(o,a \vert z)`$ (Figure [ch4-latent-variable-model]).
- +

<span id="ch4-latent-variable-model" style="position: absolute;"></span>

(A) The latent variable model in a robotics application regulates influence between observed (o, a) variables and an unobservable latent variable. (B) VAEs approximate exact latent variable models by means of variational inference.
@@ -1058,7 +1058,7 @@ VAEs approximate probability distributions via a *single* latent variable model, where we explicitly showed the marginalization over the multiple latents in eq. [BC-multi-latent-model-1], and used the law of conditional probability and Markov property in eq. [BC-multi-latent-model-2]. Also, for ease of notation, we will refer to observation-action pairs $`o,a`$ as $`z_0`$.
- +

<span id="ch4-many-latents" style="position: absolute;"></span>

HMLV models posit the data generation process is influenced by a stack of Markov-dependent latent variables, with samples from the posterior distribution being progressively higher up in the hierarchy.
@@ -1110,7 +1110,7 @@ In their seminal work on using DMs for variational inference, @hoDenoisingDiffu where the former term is equivalent to the reconstruction term in eq. [VAE-min-neg-ELBO] and the latter term can be obtained in closed form.
- +

<span id="diffusion-robot-actions" style="position: absolute;"></span>

DMs iteratively corrupt samples (left) from an unknown distribution into a quasi-standard Gaussian (center), learning the displacement field (right) that permits to reconstruct samples from the unknown target distribution by iteratively denoising samples of a tractable, easy-to-sample distribution.
@@ -1129,7 +1129,7 @@ Besides mathematical tractability of eq.  +

<span id="ch4-action-vs-observation-distribution" style="position: absolute;"></span>

A joint action-observation distribution, in the simplified case where the observation is the elbow-flex actuation in a SO-100, and the action is the recorded position for the same joint from the teleoperator arm. The motion recorded being teleoperated, the points distribute along a the diagonal.
@@ -1166,7 +1166,7 @@ FM proved very effective in a variety of applications, ranging from image @esse Conditional vector fields are defined not only over their argument $`z`$ and time $`t`$, but do also vary with respect to an auxiliary variable $`z_0`$, thereby extending the standard notion of a vector field to incorporate additional conditioning. Note that the traditional discrete-time noise-scheduler $`\{\beta_t\}_{t=0}^T`$ is now generalized to a continuous map $`\beta : [0,1] \mapsto \mathbb R^+`$. Crucially, @lipmanFlowMatchingGenerative2023 prove that by exclusively optimizing the vector field for individual data points $`z_0 \in \mathcal D`$, one also retrieves the optimal flow to morph the entire support of the initial distribution $`p_0`$ into $`p_1 \ \text{s.t.} \mathcal D \sim p_1`$.
- +

<span id="ch4-normalizing-flows" style="position: absolute;"></span>

Probability distributions can be modified differently by applying different vector fields, inducing different flows of mass across the same support (top versus bottom, using two different time-invariant 2D-fields u1(x, y) = (x, 0) and $u_2(x,y) = (x/\sqrt{2}, y/\sqrt{2})$). Notice time flows continuously in [0, 1]. FM models learn to approximate a target vector field, thereby producing arbitrary (goal) transformations of an easy-to-sample initial distribution.
@@ -1174,7 +1174,7 @@ Conditional vector fields are defined not only over their argument $`z`$ and tim While the noising schedule of DMs results in a stochastic resembling a random (Brownian) walk, FM allows for more general--potentially, deterministic--likelihood and posterior parametrization. In the FM literature the likelihood and posterior probabilty densities defined along a HMLV model are typically referred to as a *probability path*, where the distributions for successive adjacent transitions in the HMLV model are related by the (normalized) flow between them (Figure 
[ch4-normalizing-flows]). The inherent flexibility of FM is one of their key advantages over DMs, as it opens up the possibility of *learning* more efficient paths. For instance, one can design probability paths inspired by Optimal Transport (OT), a mathematical framework concerned with characterizing the most efficient morphings between probability distributions. Probability paths obtained through OT paths tend to be *straighter* than diffusion paths (Figure [ch4-diffusion-paths-versus-fm]), which can lead to faster and more stable training, as well as empirically result in higher-quality generations with fewer denoising steps at inference time. In particular, by avoiding unnecessary backtracking associated with the inherent stochastic nature of both the noising and denoising process in DMs, test-time compute is typically significantly reduced in FM, while retaining comparable results @lipmanFlowMatchingGenerative2023.
- +

<span id="ch4-diffusion-paths-versus-fm" style="position- absolute;"></span>

Compared to diffusion, flow matching distorts distribution along a less randomic pattern, resulting in a clearer interpolation between source and target distribution. The visualization shows an example comparison between these two methods on joint distribution of robot observations and actions over T = 50 steps.
@@ -1216,7 +1216,7 @@ In their work, @zhaoLearningFineGrainedBimanual2023 ablated using a GM to learn In ACT (Figure [ch4-act]), inference for a given observation $`o \in \mathcal O`$ could be performed by (1) defining a prior $`p_\omega(z \vert o)`$ for the latent variable $`z`$ and (2) decoding an action chunk from a sampled latent $`z \sim p_\omega(\bullet \vert o)`$, similarily to how sampling from standard VAEs takes place, with the exception that vanilla VAEs typically pose $`p(z\vert o) \equiv p(z) \sim \mathcal N(\mathbf{0}, \mathbf{I})`$ and thus skip (1).
- +

<span id="ch4-act-encoder" style="position: absolute;"></span>

The CVAE encoder used in ACT. Input action chunks are first embedded and aggregated with positional embeddings, before being processed alongside embedded proprioperceptive information, and a learned [CLS] token used to aggregate input level information, and predict the style variable z. The encoder is exclusively used to train the decoder, and it is entirely disregarded at inference time.
@@ -1224,7 +1224,7 @@ In ACT (Figure  +

<span id="ch4-act-decoder" style="position: absolute;"></span>

The CVAE decoder used in ACT, comprising of a full encoder-decoder Transformer architecture. Camera observations from all n camera views are first embedded using pre-trained visual encoders, and then aggregated with the corresponding positional embeddings. Then, the proprioperceptive information and style variable z retrieved from the CVAE encoder, are fed to the encoder-decoder Transformer for inference. The encoder shares the matrices K, V with the decoder, and is trained to decode fixed position embeddings into action chunks.
@@ -1232,7 +1232,7 @@ However, the authors claim that using a deterministic procedure to sample $`z`$ #### Code Example: Training and Using ACT in Practice
- +

<span id="ch4-act" style="position: absolute;"></span>

Action Chunking with Transformer (ACT), as in @zhaoLearningFineGrainedBimanual2023. ACT introduces an action chunking paradigm to cope with high-dimensional multi-modal demonstration data, and a transformer-based CVAE architecture.
@@ -1376,7 +1376,7 @@ In practice, conditioning on observation data is achieved conditioning the noise Note how in eq. 
[diffusion-policy-objective] the noise regressor is conditioned on both the latent variable rank $`t`$ *and* on a stack of previous observations $`o_{t-H_o-t}`$. @chiDiffusionPolicyVisuomotor2024 claim the combination of (1) conditioning on a horizon of previous observations and (2) predicting multiple actions into the future allows DP to *commit to specific modes* in the data at inference time, which proves essential for good performance and avoiding undecisiveness.
- +

<span id="diffusion-policy-architecture" style="position: absolute;"></span>

The Diffusion Policy archicture, as in @chiDiffusionPolicyVisuomotor2024. A stack of Ho previous observations is used as external conditioning to denoise a group of Ha actions. Conditioning is performed at every layer of a U-Net block. Diffusion Policy allows to obtain fully-formed action chunks with as little as T = 10 denoising steps.
@@ -1519,7 +1519,7 @@ One can use the fact that policies output multiple actions at the same time to d
- +

<span id="ch4-async-inference" style="position: absolute;"></span>

Asynchronous inference. Illustration of the asynchronous inference stack. Note that the policy can be run on a remote server, possibly with GPUs.
@@ -1553,7 +1553,7 @@ Figure  - +

<span id="ch4-queues" style="position: absolute;"></span>

Action queue size evolution at runtime for various levels of g when (A) not filtering out observation based on joint-space similarity and (B) filtering out near-duplicates observation, measuring their similarity in joint-space.
@@ -1704,7 +1704,7 @@ TL;DR Openly available, large-scale datasets and the development of stable-to-tr The advent of large models trained on internet-scale datasets has drastically influenced fields like Computer Vision (CV) and Natural Language Processing (NLP), shifting the previously task-specific paradigm towards combining (1) an initial, task-agnostic large-scale pre-training stage and a (2) task-specific, adjustment phase. This *pre-train-and-adaptat* paradigm has now largely replaced more classic approaches consisting of task-specific data collection, curation and model training in many subdomains within CV and NLP, and it is motivated by the main drawback of limited scalability for *task-specific approaches*, which have been traditionally more labor intensive. Factors including (1) the advancements in generalist models learned with self-supervision for perception @oquabDINOv2LearningRobust2024 or semantic understanding @devlinBERTPretrainingDeep2019 and (2) the popularization of collective efforts to aggregate large-scale openly available datasets @oneillOpenXEmbodimentRobotic2025, @khazatskyDROIDLargeScaleInTheWild2025 are increasingly pushing the field of robot learning towards the pre-train-and-adapt paradigm. This shift taps into the long-standing challenge of developing generalist robot policies, and holds the premise to surpass traditionally siloed approaches to robotics problems and develop a *foundation robotics model*. While Section 
[learning-imitation] introduced methods for learning *single-task policies* such as ACT or Diffusion Policy, in this section we present advancements in developing *generalist, multi-task, policies*, capable of performing a wide range of tasks across different environments and embodiments, and guided by unstructured instructions typically given in plain, natural language.
- +

<span id="ch5-ml-vs-robotics-foundation" style="position: absolute;"></span>

Fields within ML such as Computer Vision and NLP converged on the development of foundation models, trained on a variety of large scale models and capable to perform multiple downstream tasks (top). Conversely, robotics suffered from limited standardization in terms of the architectures used, and siloed, task specific datasets, incurring in a high degree of fragmentation which traditionally hindered the development of generalist models for robotics in favour of task-specific models (bottom).
@@ -1714,7 +1714,7 @@ The advent of large models trained on internet-scale datasets has drastically in The remarkable success of foundation models in NLP and CV seems to be increasingly predicated on two core principles: architectural innovation and (joint) data-compute scaling. Indeed, the transformer architecture proved very effective in capturing long-range dependencies in a variety of data formats, and its stability and expressivity made it the *de facto* standard for modern large-scale models trained on internet-scale datasets. However, in stark contrast with large-scale NLP and CV datasets @raffelExploringLimitsTransfer2023, @ImageNet_VSS09, robotics has historically developed around small, task-specific datasets. In turn, this traditionally hindered scalability across problems as well as results, posing concrete challenges to developing general-purpose robot learning algorithms. Indeed, differently from the wealth of relatively readily-available task-agnostic text and images datasets on the internet, robotics data is *intrinsically embodied* and thus task-specific: datasets collected for *manipulation* differ significantly from *locomotion*. In particular, since each expert trajectory is tied to a specific robot platform and the operating conditions of its environment and task, data heterogeneity has long posed a *methodological* challenge for scaling robotics datasets via aggregation. Further, datasets consisting of expert demonstrations are (1) intrinsically more expensive to collect and (2) notoriously heterogeneous--different human experts may perform the same task in very different. Beyond this, heterogeneity also raises *conceptual* issues: naively mixing data across embodiments can induce negative transfer, as control strategies developed in isolation for different robot systems in different environments may even conflict when combined. Thus, the high degree of fragmentation of robotics datasets and tasks has traditionally led to the development of *specialist* policies, trained on small, task-specific datasets, developed to perform well at their designated task but that fail to generalize to new deployment scenarios (Figure [ch5-ml-vs-robotics-foundation]).
- +

<span id="ch5-generalist-policies-timeline" style="position: absolute;"></span>

Early efforts in the development of generalist models for robotics include BC-Zero @jangBCZZeroShotTask2022, RT-1 @brohanRT1RoboticsTransformer2023, and RT-2 @brohanRT2VisionLanguageActionModels2023: large scale models trained on thousands of demonstrations. The open release of the Open-X @oneillOpenXEmbodimentRobotic2025 and DROID datasets @khazatskyDROIDLargeScaleInTheWild2025 fostered the development of open source models: OpenVLA @kimOpenVLAOpenSourceVisionLanguageAction2024, π0 @blackp0VisionLanguageActionFlow2024 and SmolVLA @shukorSmolVLAVisionLanguageActionModel2025.
@@ -1723,12 +1723,12 @@ Driven by the goal of developing generalist robot policies, the research communi In a follow-up work, the same group of authors propose a modified method to learn generalist models, leveraging (1) a more powerful architecture and (2) scaling up the dataset used . In RT-2, @brohanRT2VisionLanguageActionModels2023 propose inheriting internet-scale semantic knowledge from large-scale multi-modal datasets to learn a single, *unified model* for robotics control. Such a model, termed *Vision-Language-Action* (VLA) in the original RT-2 paper, effectively casts robot control as a language-modeling problem, and in particular as a Visual Question-Answering (VQ) task, in which the output token space used to represent *textual tokens* is shared with the *8-bits tokens* used to represent the 256 ($`2^8`$) actuation levels of a 6-dof robot. In their work, @brohanRT2VisionLanguageActionModels2023 propose co-fine-tuning large-scale VLMs such as PaLIX @chenPaLIXScalingMultilingual2023 or PaLM-E @driessPaLMEEmbodiedMultimodal2023 on a mix of (1) web and (2) robotics data, complementing VQtraining with robotics-specific signal, and learning to directly output robot actions in a shared token space for visual and language inputs. In their work, the authors claim using large models trained on internet-scale data as backbones for VLAs allows models to tap into the rich semantic knowledge embedded in the VLM’s parameters, interpreting instructions and unseen objects by connecting them to concepts acquired while pre-training. For instance, @brohanRT2VisionLanguageActionModels2023 show that while RT-2 has never been explicitly trained to repurpose tools for a *hammering* task, it can still combine its semantic understanding of images, so that when asked which object between (1) a piece of paper, (2) a pair of headphones or (3) a rock may be used instead of a hammer, it correctly answers (3). -Traditionally, research efforts revolved around not only training models, but also proposing datasets for the community, a costly and time-consuming process. Due to the aforementioned embodiment gap, the data used in research efforts in robot learning have traditionally proved rather fragmented, tailored to the specific task considered by the specific group of researchers who collected it, which ultimately hindered integration. The Open X-Embodiment project @oneillOpenXEmbodimentRobotic2025 was a landmark collaboration effort to address data fragmentation, by curating the aggregation of 60 *existing* robotics datasets from 22 different robot embodiments and 21 institutions across the world, and resulted in a total 1.4M of cross-embodiments, cross-tasks, openly-available trajectories. Besides the contribution of an aggregate, large scale dataset, @oneillOpenXEmbodimentRobotic2025 also demonstrated significant positive transfer *across tasks and embodiments*, showing that \a single model trained on multi-embodiment data can outperform specialist models\ trained on their respective single-embodiment datasets. The Distributed Robot Interaction Dataset (DROID) @khazatskyDROIDLargeScaleInTheWild2025 represents another significant step towards addressing the problem of scarse and disaggregated data in robot learning, providing a unique dataset consisting of 75k+ human demonstrations collected in realistic (*in-the-wild*) manipulation settings, providing another cornerstone for building general-purpose robot policies. Recently, foundational datasets curated through large, centralized efforts, are increasingly complemented by decentralized, community-driven contributions of robotics data. Software libraries like `lerobot` have been instrumental in enabling decentralized collection of large amounts of data, providing the infrastructure for researchers and practitioners to easily contribute trajectories from a wide range of embodiments, democratizing data access via distributed collection. +Traditionally, research efforts revolved around not only training models, but also proposing datasets for the community, a costly and time-consuming process. Due to the aforementioned embodiment gap, the data used in research efforts in robot learning have traditionally proved rather fragmented, tailored to the specific task considered by the specific group of researchers who collected it, which ultimately hindered integration. The Open X-Embodiment project @oneillOpenXEmbodimentRobotic2025 was a landmark collaboration effort to address data fragmentation, by curating the aggregation of 60 *existing* robotics datasets from 22 different robot embodiments and 21 institutions across the world, and resulted in a total 1.4M of cross-embodiments, cross-tasks, openly-available trajectories. Besides the contribution of an aggregate, large scale dataset, @oneillOpenXEmbodimentRobotic2025 also demonstrated significant positive transfer *across tasks and embodiments*, showing that \a single model trained on multi-embodiment data can outperform specialist models\ trained on their respective single-embodiment datasets. The Distributed Robot Interaction Dataset (DROID) @khazatskyDROIDLargeScaleInTheWild2025 represents another significant step towards addressing the problem of scarse and disaggregated data in robot learning, providing a unique dataset consisting of 75k+ human demonstrations collected in realistic (*in-the-wild*) manipulation settings, providing another cornerstone for building general-purpose robot policies. Recently, foundational datasets curated through large, centralized efforts, are increasingly complemented by decentralized, community-driven contributions of robotics data. Software libraries like `lerobot` have been instrumental in enabling decentralized collection of large amounts of data, providing the infrastructure for researchers and practitioners to easily contribute trajectories from a wide range of embodiments, democratizing data access via distributed collection. Despite these advancements, the success of large, proprietary models like RT-1 and RT-2, highlighted a growing accessibility gap in robotics research, as training and deploying large-scale robotics foundation models requires computational resources simply unattainable for most research institutions. The OpenVLA project @kimOpenVLAOpenSourceVisionLanguageAction2024 emerged in direct contrast to traditionally closed-source efforts to develop VLAs. In particular, @kimOpenVLAOpenSourceVisionLanguageAction2024 trained OpenVLA by exclusively leveraging openly available data (970k+ trajectories from the Open-X dataset), and openly shared their training recipes alongside the model weights. Architecturally, OpenVLA integrates a pre-trained vision encoder to project visual tokens into the embedding space of the Llama2-7B @touvronLlama2Open2023 language-model backbone. The language model backbone is then used to predict *discrete action tokens* over 256 activation levels.
- +

<span id="ch5-trends" style="position: absolute;"></span>

Robot learning is undergoing a paradigmatic shift: centralized data collections (A, left) are increasingly larger, often comprising millions of demonstrations, while (A, right) decentralized data collection efforts are becoming an alternative for large scale data collection. (B) Generalist models are also becoming increasingly smaller and easier to run on limited hardware.
@@ -1752,7 +1752,7 @@ Recently, compute efficiency has also become a central focus in multi-modal rese $`\pi_0`$ @blackp0VisionLanguageActionFlow2024 introduce a VLA consisting of a MoE architecture consisting of (1) a pre-trained VLM backbone (Gemma 2.6B @teamGemma2Improving2024) and (2) a dedicated action expert used to generate continuous actions via flow matching. Images and language are embedded with PaliGemma, a VLM merging independently encoded visual and textual features deep in the network (*late-fusion*), while proprioceptive state and actions chunks are routed to a smaller *action expert*, initialized from scratch. The two separate experts communicate via self-attention layers, but maintain disjoint weights to obtain query, key and values matrices at each layer, maintaining specialization while efficiently allocating computation.
- +

<span id="ch5-pi0" style="position: absolute;"></span>

The π0 architecture, as in @blackp0VisionLanguageActionFlow2024. Vision and language tokens are routed to a VLM backbone which is prevented from attending robot proprioperceptive states and action tokens, which are instead routed to a smaller subset of weights within the architecture referred to as "action expert". The architecture is trained with Flow Matching on 10M+ trajectories from a mixture of closed and openly available datasets.
@@ -1796,7 +1796,7 @@ Flow matching  can be seen as a continuous time, deterministic generalization o
-r0.4 image +r0.4 image \\ @@ -1875,7 +1875,7 @@ for epoch in range(num_epochs): With VLAs in the early stage of development compared to more mature LLMs and VLMs, much of the progress made on VLAs remains proprietary, with many releases exclusively sharing the weights while withholding the data used, full experimental details and essential methodological components of training. In constrast with this closed approach, SmolVLA @shukorSmolVLAVisionLanguageActionModel2025 is an entirely open-source research effort, which aims at democratizing the developments of robotics foundation models by open sourcing the model alongside the data used as well as the training recipes.
- +

<span id="ch5-smolvla" style="position: absolute;"></span>

The SmolVLA architecture, as in @shukorSmolVLAVisionLanguageActionModel2025. SmolVLA is a compact MoE model trained with flow matching to denoise action chunks. Vision and language tokens are fed to a VLM backbone, and share information with the proprioperceptive and action tokens via the attention mechanism. The attention expert interleaves SA and CA layers for further conditioning on the visual features from the VLM backbone. SmolVLA skips computations and reduces the visual tokens, resulting in 7x less memory usage than π0 (450M parameters vs. π0’s 3.3B).
@@ -1956,7 +1956,7 @@ for epoch in range(num_epochs): \\ -This tutorial has charted the paradigmatic shift transforming robotics, tracing the \evolution of robotics from structured, model-based methods to the dynamic, data-driven approaches that define modern robot learning\. We began by examining the limitations of traditional dynamics-based control, namely its brittleness and significant engineering overhead, which motivate the adoption of more flexible, learning-based alternatives. Unlike scalable, data-driven techniques, conventional explicit models demand extensive human expertise, hindering wider accessibility and scalability of robotics. +This tutorial has charted the paradigmatic shift transforming robotics, tracing the \evolution of robotics from structured, model-based methods to the dynamic, data-driven approaches that define modern robot learning\. We began by examining the limitations of traditional dynamics-based control, namely its brittleness and significant engineering overhead, which motivate the adoption of more flexible, learning-based alternatives. Unlike scalable, data-driven techniques, conventional explicit models demand extensive human expertise, hindering wider accessibility and scalability of robotics. Our exploration traced a clear trajectory of progress, beginning with Reinforcement Learning (RL). While RL offers a powerful paradigm for learning through interaction, its application in robotics is complicated by challenges such as sample inefficiency, safety concerns in real-world training, and the complexities of reward design. We saw how modern approaches like HIL-SERL make real-world RL more feasible by incorporating training-time human guidance, datasets of previously collected data as well as learned reward classifiers.