remdm-minihack
/

ReMDM-MiniHack

Reinforcement Learning

behavior-cloning

Model card Files Files and versions

ReMDM-MiniHack / README.md

piotrwilam's picture

Initial repo structure with model card

427cea3 verified 2 months ago

|

2.3 kB

	---
	license: mit
	tags:
	- reinforcement-learning
	- minihack
	- diffusion
	- planning
	- behavior-cloning
	---

	# ReMDM-MiniHack

	Generative Planning Agent for MiniHack navigation using Re-Masked Discrete Diffusion (ReMDM).

	The agent uses Masked Discrete Diffusion to iteratively generate action sequences for dungeon navigation.
	Instead of predicting the next action autoregressively, the model generates entire 64-step trajectories
	by progressively unmasking action tokens.

	## Code

	GitHub: [piotrwilam/ReMDM-MiniHack-Project](https://github.com/piotrwilam/ReMDM-MiniHack-Project)

	## Models

	\| Version \| Model \| Params \| Training \| Tag \|
	\|---\|---\|---\|---\|---\|
	\| v017_local_baseline \| LocalDiffusionPlanner \| 7M \| Offline BC, 200 demos/env, 30 epochs \| — \|
	\| v017_local_baseline \| LocalDiffusionPlanner \| 7M \| Offline BC, 500 demos/env, 60 epochs \| `v0.17-local-baseline-gold` (pending) \|

	## Repo Structure

	```
	ReMDM-MiniHack/
	├── README.md # This file
	├── v017_local_baseline/
	│ ├── inference_weights.pth # EMA state dict (for evaluation)
	│ ├── full_checkpoint.pth # Full training state (for resuming)
	│ ├── config.json # Hyperparams + model args
	│ └── eval_results.csv # Per-environment results
	└── datasets/
	└── oracle_demos_v017.pt # Oracle demonstration dataset
	```

	## Quick Start

	```python
	import torch
	from huggingface_hub import hf_hub_download

	# Download weights
	path = hf_hub_download("piotrwilam/ReMDM-MiniHack", "v017_local_baseline/inference_weights.pth")
	weights = torch.load(path, map_location="cpu", weights_only=False)

	# Load model
	from model import LocalDiffusionPlanner
	model = LocalDiffusionPlanner(action_dim=12)
	model.load_state_dict(weights)
	model.eval()
	```

	## Results: v017 Local Baseline (Offline BC, 200 demos/env, 30 epochs)

	\| Environment \| Win% \| Avg Steps \|
	\|---\|---\|---\|
	\| Room-Random-5x5 \| 94% \| 18.3 \|
	\| Room-Random-15x15 \| 54% \| 130.4 \|
	\| Room-Dark-5x5 \| 90% \| 25.5 \|
	\| Room-Ultimate-5x5 \| 84% \| 20.8 \|
	\| Room-Ultimate-15x15 \| 30% \| 72.1 \|
	\| Corridor-R2 \| 42% \| 132.1 \|
	\| Corridor-R3 \| 0% \| 200.0 \|
	\| MazeWalk-9x9 \| 48% \| 119.0 \|
	\| MazeWalk-15x15 \| 22% \| 162.3 \|