Spaces:
Sleeping
Sleeping
| title: LandscapeForge | |
| emoji: ποΈ | |
| colorFrom: purple | |
| colorTo: blue | |
| sdk: docker | |
| pinned: true | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - optimization | |
| - llm-agents | |
| - self-improvement | |
| - gradio | |
| license: apache-2.0 | |
| short_description: LLM agent designs optimizers via a probe-draft-commit REPL. | |
| # ποΈ LandscapeForge | |
| **An OpenEnv where an LLM agent designs optimization algorithms through a probe-draft-commit REPL, trained against a Goldilocks-regulated landscape adversary.** | |
| Target: **OpenEnv Hackathon, April 2026**. Theme 4 (Self-Improvement), secondary Theme 1 (Multi-Agent). | |
| --- | |
| ## What this Space gives you | |
| Two things, in one container: | |
| | Path | What it is | | |
| |---|---| | |
| | **`/web`** | **Interactive Gradio demo** β landscape explorer + baseline race + paste-your-own-optimizer arena. Visual-first, meant to make the env legible to judges. | | |
| | **`/reset`, `/step`, `/schema`, WebSocket** | **OpenEnv FastAPI endpoints** β wire the env into a TRL / Unsloth GRPO training loop. | | |
| Same process, no second container required. | |
| --- | |
| ## How the env works (in 90 seconds) | |
| **OptCoder** is the LLM policy. Each episode: | |
| 1. **LandscapeForge** (an internal template picker in v1) chooses a loss landscape `f: ββΏ β β` at a tier-appropriate difficulty: quadratic / Rosenbrock / Styblinski-Tang / Gaussian-mix / Himmelblau / plateau / cliff. | |
| 2. **OptCoder runs a 4-action REPL** with a budget of 12 units: | |
| - `run_baseline(name)` β run SGD / Momentum / Adam / L-BFGS on the hidden landscape and see their trajectory (cost: 2) | |
| - `draft(code)` β submit a full `Optimizer` class; the env auto-tests it for 20 steps (cost: 2) | |
| - `inspect(draft_idx, step_range)` β zoom into a prior draft's per-step `(x, f, grad, update_norm, step_size_eff)` to diagnose failures (cost: 1) | |
| - `commit` β evaluate the latest draft on the full **Phase-D arena**: 10 fresh seeds Γ 200 steps (cost: 0) | |
| 3. **Reward** (terminal only; stepwise is feedback-only): | |
| - `r_regret` β **Adam-relative progress** (tuned Adam LR per landscape; no `f_min` dependency, generalises directly to NN training) | |
| - `r_convergence`, `r_robustness`, `r_novelty` (gated), minus `r_budget`, minus `r_eval_failures` | |
| 4. **GRPO** can then train the policy; arena wall-clock is ~50 ms so ~36k episodes/hour on one H100. | |
| See `IMPLEMENTATION.md` and `LANDSCAPEFORGE_DESIGN.md` in this repo for the full spec, staged bootstrap (SFT β solo RL β adversarial unfreezing), and the anti-reward-hacking table. | |
| --- | |
| ## Quick-start (Python client) | |
| ```python | |
| from landscapeforge import LandscapeforgeEnv, LandscapeforgeAction | |
| with LandscapeforgeEnv.from_docker_image("landscapeforge-env:latest") as env: | |
| obs = env.reset() | |
| env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam")) | |
| env.step(LandscapeforgeAction(kind="draft", code=""" | |
| class Optimizer: | |
| def __init__(self, dim): | |
| self.lr = 0.05; self.beta = 0.9 | |
| self.v = np.zeros(dim) | |
| def step(self, x, f_val, grad): | |
| self.v = self.beta * self.v - self.lr * grad | |
| return x + self.v | |
| """)) | |
| result = env.step(LandscapeforgeAction(kind="commit")) | |
| print(result.observation.r_optcoder_breakdown) | |
| # {'r_regret': ..., 'r_convergence': ..., 'r_robustness': ..., | |
| # 'r_novelty': ..., 'r_budget': ..., 'r_eval_failures': ..., | |
| # 'my_progress': ..., 'adam_progress': ..., 'speedup_vs_adam': ...} | |
| ``` | |
| ## Quick-start (drive with any OpenAI-compat LLM) | |
| The repo ships `run_llm_episode.py` that drives one episode against any `/v1/chat/completions` endpoint (HuggingFace router, Ollama, vLLM, β¦): | |
| ```bash | |
| # Ollama local | |
| API_BASE_URL=http://localhost:11434/v1 MODEL_NAME=qwen2.5:3b \ | |
| python -m landscapeforge.run_llm_episode | |
| # HuggingFace router | |
| HF_TOKEN=hf_xxx MODEL_NAME=Qwen/Qwen2.5-7B-Instruct \ | |
| python -m landscapeforge.run_llm_episode | |
| ``` | |
| Full turn transcripts (prompt, raw reply, parsed action, env feedback, reward breakdown) are written to `episode_logs/*.jsonl` + `*.md`. | |
| --- | |
| ## What to click first in `/web` | |
| 1. **Baseline Race** tab β pick Rosenbrock β hit "π Race!" to see how default-SGD, default-Momentum, **tuned-Adam**, and crude-L-BFGS actually perform on the classic stiff valley. | |
| 2. **Optimizer Arena** tab β keep the sample SGD+Momentum optimizer, hit "βοΈ Run arena" to see the reward breakdown vs tuned Adam. | |
| 3. **Landscape Explorer** tab β browse the 9 template families with contour plots + structural hints. | |
| --- | |
| ## Repo structure | |
| ``` | |
| landscapeforge/ | |
| βββ LANDSCAPEFORGE_DESIGN.md # Full design doc (v0.2) | |
| βββ IMPLEMENTATION.md # What's in the code today + constants | |
| βββ models.py # Action + Observation (pydantic) | |
| βββ landscapes.py # 9 analytic template builders with gradients | |
| βββ reference_optimizers.py # SGD / Momentum / Adam / L-BFGS + LR tuner | |
| βββ sandbox.py # AST-strip + restricted exec + timeout | |
| βββ arena.py # Phase-D runner + auto_test_draft | |
| βββ rewards.py # Terminal reward + stepwise feedback | |
| βββ prompts.py # obs β prompt / response β action | |
| βββ run_llm_episode.py # LLM-in-the-loop runner (OpenAI-compat) | |
| βββ server/ | |
| β βββ app.py # FastAPI + mounted Gradio at /web | |
| β βββ landscapeforge_environment.py # OpenEnv Environment class | |
| βββ demo/ui.py # Gradio UI source | |
| βββ tests/test_episode.py # Scripted end-to-end tests | |
| βββ episode_logs/ # Per-episode JSONL + Markdown transcripts | |
| ``` | |
| --- | |
| ## Research anchors | |
| LandscapeForge sits at the intersection of five established research threads: | |
| - **Thread 1** β LLMs as optimizer designers: [Lion (NeurIPS 2023)](https://arxiv.org/abs/2302.06675), [FunSearch (Nature 2024)](https://www.nature.com/articles/s41586-023-06924-6) | |
| - **Thread 2** β Adversarial / co-evolutionary LLM-env: Coevolve, [GenEnv (ICLR 2026)](https://arxiv.org/html/2512.19682v1) | |
| - **Thread 3** β Iterative code refinement: [Self-Refine](https://arxiv.org/abs/2303.17651) | |
| - **Thread 4** β GRPO with measurable rewards: [HPC GFLOPS reward paper](https://arxiv.org/abs/2602.12049v1) | |
| - **Thread 5** β Analytical landscape benchmarks: [BBOB/COCO](https://inria.hal.science/hal-00362649/document), [POET](https://arxiv.org/abs/1901.01753) | |
| Every ingredient has prior work; the combination β LLM-generated optimizers + LLM-picked landscapes + iterative REPL + GRPO on Adam-relative progress β is novel. | |
| --- | |
| ## License | |
| Apache-2.0. | |