Spaces:
Sleeping
Sleeping
File size: 6,644 Bytes
21a64d5 b0b140b 21a64d5 b0b140b 21a64d5 b0b140b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | ---
title: LandscapeForge
emoji: ποΈ
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
app_port: 8000
tags:
- openenv
- reinforcement-learning
- optimization
- llm-agents
- self-improvement
- gradio
license: apache-2.0
short_description: LLM agent designs optimizers via a probe-draft-commit REPL.
---
# ποΈ LandscapeForge
**An OpenEnv where an LLM agent designs optimization algorithms through a probe-draft-commit REPL, trained against a Goldilocks-regulated landscape adversary.**
Target: **OpenEnv Hackathon, April 2026**. Theme 4 (Self-Improvement), secondary Theme 1 (Multi-Agent).
---
## What this Space gives you
Two things, in one container:
| Path | What it is |
|---|---|
| **`/web`** | **Interactive Gradio demo** β landscape explorer + baseline race + paste-your-own-optimizer arena. Visual-first, meant to make the env legible to judges. |
| **`/reset`, `/step`, `/schema`, WebSocket** | **OpenEnv FastAPI endpoints** β wire the env into a TRL / Unsloth GRPO training loop. |
Same process, no second container required.
---
## How the env works (in 90 seconds)
**OptCoder** is the LLM policy. Each episode:
1. **LandscapeForge** (an internal template picker in v1) chooses a loss landscape `f: ββΏ β β` at a tier-appropriate difficulty: quadratic / Rosenbrock / Styblinski-Tang / Gaussian-mix / Himmelblau / plateau / cliff.
2. **OptCoder runs a 4-action REPL** with a budget of 12 units:
- `run_baseline(name)` β run SGD / Momentum / Adam / L-BFGS on the hidden landscape and see their trajectory (cost: 2)
- `draft(code)` β submit a full `Optimizer` class; the env auto-tests it for 20 steps (cost: 2)
- `inspect(draft_idx, step_range)` β zoom into a prior draft's per-step `(x, f, grad, update_norm, step_size_eff)` to diagnose failures (cost: 1)
- `commit` β evaluate the latest draft on the full **Phase-D arena**: 10 fresh seeds Γ 200 steps (cost: 0)
3. **Reward** (terminal only; stepwise is feedback-only):
- `r_regret` β **Adam-relative progress** (tuned Adam LR per landscape; no `f_min` dependency, generalises directly to NN training)
- `r_convergence`, `r_robustness`, `r_novelty` (gated), minus `r_budget`, minus `r_eval_failures`
4. **GRPO** can then train the policy; arena wall-clock is ~50 ms so ~36k episodes/hour on one H100.
See `IMPLEMENTATION.md` and `LANDSCAPEFORGE_DESIGN.md` in this repo for the full spec, staged bootstrap (SFT β solo RL β adversarial unfreezing), and the anti-reward-hacking table.
---
## Quick-start (Python client)
```python
from landscapeforge import LandscapeforgeEnv, LandscapeforgeAction
with LandscapeforgeEnv.from_docker_image("landscapeforge-env:latest") as env:
obs = env.reset()
env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam"))
env.step(LandscapeforgeAction(kind="draft", code="""
class Optimizer:
def __init__(self, dim):
self.lr = 0.05; self.beta = 0.9
self.v = np.zeros(dim)
def step(self, x, f_val, grad):
self.v = self.beta * self.v - self.lr * grad
return x + self.v
"""))
result = env.step(LandscapeforgeAction(kind="commit"))
print(result.observation.r_optcoder_breakdown)
# {'r_regret': ..., 'r_convergence': ..., 'r_robustness': ...,
# 'r_novelty': ..., 'r_budget': ..., 'r_eval_failures': ...,
# 'my_progress': ..., 'adam_progress': ..., 'speedup_vs_adam': ...}
```
## Quick-start (drive with any OpenAI-compat LLM)
The repo ships `run_llm_episode.py` that drives one episode against any `/v1/chat/completions` endpoint (HuggingFace router, Ollama, vLLM, β¦):
```bash
# Ollama local
API_BASE_URL=http://localhost:11434/v1 MODEL_NAME=qwen2.5:3b \
python -m landscapeforge.run_llm_episode
# HuggingFace router
HF_TOKEN=hf_xxx MODEL_NAME=Qwen/Qwen2.5-7B-Instruct \
python -m landscapeforge.run_llm_episode
```
Full turn transcripts (prompt, raw reply, parsed action, env feedback, reward breakdown) are written to `episode_logs/*.jsonl` + `*.md`.
---
## What to click first in `/web`
1. **Baseline Race** tab β pick Rosenbrock β hit "π Race!" to see how default-SGD, default-Momentum, **tuned-Adam**, and crude-L-BFGS actually perform on the classic stiff valley.
2. **Optimizer Arena** tab β keep the sample SGD+Momentum optimizer, hit "βοΈ Run arena" to see the reward breakdown vs tuned Adam.
3. **Landscape Explorer** tab β browse the 9 template families with contour plots + structural hints.
---
## Repo structure
```
landscapeforge/
βββ LANDSCAPEFORGE_DESIGN.md # Full design doc (v0.2)
βββ IMPLEMENTATION.md # What's in the code today + constants
βββ models.py # Action + Observation (pydantic)
βββ landscapes.py # 9 analytic template builders with gradients
βββ reference_optimizers.py # SGD / Momentum / Adam / L-BFGS + LR tuner
βββ sandbox.py # AST-strip + restricted exec + timeout
βββ arena.py # Phase-D runner + auto_test_draft
βββ rewards.py # Terminal reward + stepwise feedback
βββ prompts.py # obs β prompt / response β action
βββ run_llm_episode.py # LLM-in-the-loop runner (OpenAI-compat)
βββ server/
β βββ app.py # FastAPI + mounted Gradio at /web
β βββ landscapeforge_environment.py # OpenEnv Environment class
βββ demo/ui.py # Gradio UI source
βββ tests/test_episode.py # Scripted end-to-end tests
βββ episode_logs/ # Per-episode JSONL + Markdown transcripts
```
---
## Research anchors
LandscapeForge sits at the intersection of five established research threads:
- **Thread 1** β LLMs as optimizer designers: [Lion (NeurIPS 2023)](https://arxiv.org/abs/2302.06675), [FunSearch (Nature 2024)](https://www.nature.com/articles/s41586-023-06924-6)
- **Thread 2** β Adversarial / co-evolutionary LLM-env: Coevolve, [GenEnv (ICLR 2026)](https://arxiv.org/html/2512.19682v1)
- **Thread 3** β Iterative code refinement: [Self-Refine](https://arxiv.org/abs/2303.17651)
- **Thread 4** β GRPO with measurable rewards: [HPC GFLOPS reward paper](https://arxiv.org/abs/2602.12049v1)
- **Thread 5** β Analytical landscape benchmarks: [BBOB/COCO](https://inria.hal.science/hal-00362649/document), [POET](https://arxiv.org/abs/1901.01753)
Every ingredient has prior work; the combination β LLM-generated optimizers + LLM-picked landscapes + iterative REPL + GRPO on Adam-relative progress β is novel.
---
## License
Apache-2.0.
|