Habitat 3.0 Social Rearrangement β Fabric (Learned Communication)
Trained weights for the Pereason + Go + Fabric multi-agent coordination model on the Social Rearrangement task from Habitat 3.0. Two embodied agents β a Boston Dynamics Spot robot and a humanoid β cooperate to rearrange objects across 37 HSSD scenes.
This model uses Fabric, a learned cross-attention communication module that lets agents exchange compressed multi-modal feature messages each timestep. Fabric fuses semantic (VLM) and geometric (depth) tokens before encoding them into inter-agent messages, enabling richer coordination. Compared to the no-communication Go baseline (69.6% task success), Fabric reaches 76.8% task success with significantly lower collision rates.
This work is part of the thesis "Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics" by Benjamin Kubwimana. Fabric is the shared-context communication layer explored in the thesis, evaluated here on the Habitat 3.0 social rearrangement benchmark.
What's in this repo
| File | Description |
|---|---|
| model-latest.pth | Peak unified Fabric checkpoint (ckpt.250, ~64M frames, 76.8% success) |
| model-semantic-fabric.pth | Semantic-only Fabric baseline (ckpt.69, ~57M frames, 63.9% success) |
Each checkpoint contains the full model state dict for both agents (keys 0, 1) plus training config.
Task overview
Each episode drops two agents into an HSSD home scene with objects that need to be moved to goal locations. The task is structured as a PDDL planning problem with four subgoal stages:
- Stage 1.1: Agent 0 (Spot) picks up its target object
- Stage 1.2: Agent 0 places the object at the goal
- Stage 2.1: Agent 1 (Humanoid) picks up its target object
- Stage 2.2: Agent 1 places the object at the goal
Full success (pddl_success) requires both agents to complete all subgoals within 750 timesteps.
Architecture
Both agents use a hierarchical RL policy with three main components:
Pereason (Perception + Reasoning)
- VLM backbone: SmolVLM2-500M-Video-Instruct (frozen, 303M params) β processes RGB via a ViT encoder and generates semantic tokens through a truncated language decoder (16 of 32 layers)
- Depth encoder: Depth Anything V2 Small (trainable, 24.8M params β 22.1M trainable) β encodes depth images into geometric tokens
- PDDL task instructions are tokenized and fed to the VLM alongside the visual input
Go (Skill Selector)
- 8-block transformer that attends over semantic + geometric tokens via cross-attention, then selects a high-level skill
- Outputs a categorical distribution over available actions (nav_to_obj, pick, place, etc.)
- 8 learnable query tokens, 128 context tokens
Fabric (Communication)
- Geometric projection + pooling: a learned linear layer projects depth tokens (384-dim) to the VLM feature dimension (960-dim), then chunk-based average pooling compresses ~1370 spatial tokens down to 128
- Token fusion: pooled geometric tokens are concatenated with semantic tokens into a single fused sequence per agent
- Per-agent
MessageEncoder(Linear β GELU β Linear β LayerNorm) compresses the fused multi-modal features into 128-dim messages - Per-agent
MessageDecoder(cross-attention with gated residual) decodes the partner's message back into the agent's fused feature space; the semantic portion is then extracted for downstream use - Gate bias initialized at -2.0 so communication starts near-zero and ramps up as training progresses
- Symmetric t-1 communication: both agents read from the previous timestep's bus, eliminating the asymmetry where agent_1 sees current features but agent_0 only sees stale ones
- Original full-resolution geometric tokens bypass Fabric and go directly to Go alongside the enriched semantics
Trainable parameters (per agent)
| Module | Total params | Trainable params |
|---|---|---|
| SmolVLM2 (VLM) | 303M | 0 (frozen) |
| Depth Anything V2 | 24.8M | 22.1M |
| Go (transformer) | 73.5M | 73.5M |
| Fabric + geo_proj | 6.7M | 6.7M |
| Total | 408.0M | 102.3M |
Low-level skills
Oracle navigation + learned manipulation skills (pick, place, nav_to_obj, etc.) using privileged simulator information.
Observations per agent:
- RGB camera image (arm cam for Spot, head cam for Humanoid)
- Depth camera image
- Binary
is_holdingflag - GPS+compass to object start/goal positions
- Relative GPS to the other agent
Training details
| Parameter | Value |
|---|---|
| Total frames | ~64M |
| Batch size | 128 steps x 64 envs |
| Learning rate | 1e-4 (Go), 1e-5 (Fabric, geo_proj, Depth encoder) |
| PPO epochs | 2 |
| Mini-batches | 2 |
| Clip param | 0.2 |
| Discount (gamma) | 0.99 |
| GAE (tau) | 0.95 |
| Entropy coef | 0.001 |
| Max grad norm | 0.5 |
| Trainer | DD-PPO |
| Fabric msg_dim | 128 |
| Fabric num_heads | 4 |
| geo_proj pooled tokens | 128 |
Results
| Metric | Value |
|---|---|
| Task success | 76.8% |
| Reward | 25.5 |
| Stage 1.1 (Spot picks) | 95.8% |
| Stage 1.2 (Spot places) | 87.2% |
| Stage 2.1 (Human picks) | 95.2% |
| Stage 2.2 (Human places) | 83.0% |
| Collision rate | 10.7% |
| Cooperation reward | +2.79 |
Comparison with no-communication baseline
| Metric | Go β No Comm | Fabric | Delta |
|---|---|---|---|
| Task success | 69.6% | 76.8% | +7.2pp |
| Reward | 23.6 | 25.5 | +1.9 |
| Collision rate | 15.4% | 10.7% | -4.7pp |
| Cooperation reward | +2.77 | +2.79 | +0.02 |
Training curve
| Frames | Reward | Task success | Collisions | Cooperation |
|---|---|---|---|---|
| 0M | 1.0 | 0.0% | 100.0% | -0.50 |
| 10M | 9.2 | 5.5% | 25.7% | +0.11 |
| 25M | 12.3 | 13.6% | 19.3% | +0.44 |
| 45M | 24.1 | 72.2% | 12.5% | +2.66 |
| 50M | 24.5 | 73.0% | 11.6% | +2.72 |
| 64M | 25.5 | 76.8% | 10.7% | +2.79 |
How to evaluate
Requires Habitat 3.0 (v0.3.3) with habitat-baselines, habitat-sim, and the Pereason+Go+Fabric policy classes.
python -u -m habitat_baselines.run \
--config-name=social_rearrange/pereason_go_fabric \
habitat_baselines.evaluate=True \
habitat_baselines.eval.should_load_ckpt=True \
habitat_baselines.eval_ckpt_path_dir=model-latest.pth \
habitat_baselines.test_episode_count=50 \
habitat_baselines.num_environments=1 \
habitat.dataset.data_path=data/datasets/hab3_episodes/val/social_rearrange.json.gz \
habitat.dataset.scenes_dir=data/scene_datasets \
'habitat_baselines.eval.video_option=["disk"]'
Dependencies
- SmolVLM2-500M-Video-Instruct (downloaded automatically)
- Depth-Anything-V2-Small-hf (downloaded automatically)
- Habitat 3.0 with HSSD scenes and
hab3_episodesdataset
Citation
@mastersthesis{kubwimana2026scalable,
title = {Scalable Multi-Agent Coordination Using a Shared-Context Architecture for Embodied Robotics},
author = {Kubwimana, Benjamin},
year = {2026},
school = {Georgia Institute of Technology},
note = {Model weights: \url{https://huggingface.co/edge-inference/hab3-social-rearrange-fabric}}
}
Built on the Habitat 3.0 platform:
@inproceedings{puig2023habitat3,
title = {Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots},
author = {Puig, Xavier and Undersander, Eric and Szot, Andrew and Cote, Mikael Dallaire and Batra, Dhruv and Berges, Vincent-Pierre and others},
booktitle = {ICLR},
year = {2024}
}
License
MIT. The underlying Habitat platform and HSSD scenes have their own licenses β see the Habitat 3.0 repo for details.
