Habitat 3.0 Social Rearrangement β Independent Baseline (No Communication)
Trained weights for the independent policy baseline on the Social Rearrangement task from Habitat 3.0. Two embodied agents β a Boston Dynamics Spot robot and a humanoid β cooperate to rearrange objects across 37 HSSD scenes.
Agents can only observe each other's relative GPS position. They share no explicit messages, no shared context, no coordination protocol β just independent PPO policies learning to not get in each other's way.
For the learned communication variant that achieves 2.3x the task success, see hab3-social-rearrange-fabric.
This work is part of the thesis "Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics" by Benjamin Kubwimana.
What's in this repo
| File | Description |
|---|---|
model.pth |
Final checkpoint after 100M frames (65MB) |
training_curve.log |
Raw training log with per-update metrics |
The checkpoint contains the full model state dict for both agents.
Task overview
Each episode drops the two agents into an HSSD home scene with a set of objects that need to be moved to goal locations. The task is structured as a PDDL planning problem with four subgoal stages:
- Stage 1.1: Agent 0 (Spot) picks up its target object
- Stage 1.2: Agent 0 places the object at the goal
- Stage 2.1: Agent 1 (Humanoid) picks up its target object
- Stage 2.2: Agent 1 places the object at the goal
Full success (pddl_success) requires both agents to complete all their subgoals within 750 timesteps.
Architecture
Both agents use a hierarchical RL policy:
- High-level: Neural network (ResNet18 visual encoder β 2-layer LSTM) that selects which skill to execute
- Low-level: Oracle navigation + learned manipulation skills (pick, place, nav_to_obj, etc.)
The high-level policy is what gets trained. Low-level skills use privileged oracle information (perfect pathfinding, etc.).
Observations per agent:
- Depth camera image
- Binary
is_holdingflag - GPS+compass to object start/goal positions
- Relative GPS to the other agent
Trainer: DD-PPO (Decentralized Distributed PPO) across 7 GPUs, 18 environments per GPU (126 parallel environments total).
Training details
| Parameter | Value |
|---|---|
| Total frames | 100M |
| Updates | 6,277 |
| Batch size | 128 steps Γ 126 envs |
| Learning rate | 2.5e-4 |
| PPO epochs | 1 |
| Mini-batches | 2 |
| Clip param | 0.2 |
| Discount (Ξ³) | 0.99 |
| GAE (Ο) | 0.95 |
| Entropy coef | 0.0001 |
| Max grad norm | 0.2 |
| Backbone | ResNet18 |
| RNN | 2-layer LSTM |
| Trainable params | ~8.4M per agent |
| Wall time | ~40 hours (including restarts) |
| Throughput | ~697 fps |
Final metrics
| Metric | Value |
|---|---|
| Reward | 15.5 |
| Full task success | 27.7% |
| Stage 1.1 (Spot picks) | 85.6% |
| Stage 1.2 (Spot places) | 46.4% |
| Stage 2.1 (Human picks) | 81.6% |
| Stage 2.2 (Human places) | 49.4% |
| Collision rate | 29.5% |
| Cooperation reward | +1.04 |
Training curve
| Frames | Reward | Task success | Collisions | Cooperation |
|---|---|---|---|---|
| 1.6M (upd 100) | 1.8 | 0.0% | 91% | -0.42 |
| 8M (upd 500) | 5.2 | 1.2% | 68% | -0.28 |
| 16M (upd 1000) | 7.1 | 3.4% | 58% | -0.20 |
| 48M (upd 3000) | 10.5 | 11% | 47% | +0.26 |
| 96M (upd 6000) | 15.0 | 26% | 30% | +0.99 |
| 100M (upd 6277) | 15.5 | 27.7% | 29.5% | +1.04 |
Cooperation reward flips from negative to positive around 25M frames β that's roughly when the agents stop bumping into each other and start implicitly coordinating.
How to evaluate
Requires Habitat 3.0 (v0.3.3) with habitat-baselines and habitat-sim installed.
python -u -m habitat_baselines.run \
--config-name=social_rearrange/pop_play.yaml \
habitat_baselines.evaluate=True \
habitat_baselines.eval.should_load_ckpt=True \
habitat_baselines.eval_ckpt_path_dir=model.pth \
habitat_baselines.test_episode_count=50 \
habitat_baselines.num_environments=1 \
habitat.dataset.data_path=data/datasets/hab3_episodes/val/social_rearrange.json.gz \
habitat.dataset.scenes_dir=data/scene_datasets \
'habitat_baselines.eval.video_option=["disk"]'
Known quirks
The official hab3_episodes dataset has some inconsistencies:
- ~39% of episodes reference objects in
name_to_receptaclethat aren't in the episode'srigid_objslist - Some episode spawn positions don't land on the navmesh
We patched kinematic_relationship_manager.py and rearrange_sim.py to handle these gracefully (skip missing objects, fallback to random navigable points). Without these patches, training will crash intermittently on certain episodes.
Citation
@mastersthesis{kubwimana2026scalable,
title = {Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics},
author = {Kubwimana, Benjamin},
year = {2026},
school = {Georgia Institute of Technology},
note = {Baseline model: \url{https://huggingface.co/edge-inference/hab3-social-rearrange-baseline}}
}
Built on the Habitat 3.0 platform:
@inproceedings{puig2023habitat3,
title = {Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots},
author = {Puig, Xavier and Undersander, Eric and Szot, Andrew and Cote, Mikael Dallaire and Batra, Dhruv and Berges, Vincent-Pierre and others},
booktitle = {ICLR},
year = {2024}
}
License
MIT. The underlying Habitat platform and HSSD scenes have their own licenses β see the Habitat 3.0 repo for details.