File size: 6,644 Bytes
21a64d5
b0b140b
 
 
 
21a64d5
b0b140b
 
 
 
 
 
 
 
 
 
 
21a64d5
 
b0b140b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: LandscapeForge
emoji: πŸ”οΈ
colorFrom: purple
colorTo: blue
sdk: docker
pinned: true
app_port: 8000
tags:
  - openenv
  - reinforcement-learning
  - optimization
  - llm-agents
  - self-improvement
  - gradio
license: apache-2.0
short_description: LLM agent designs optimizers via a probe-draft-commit REPL.
---

# πŸ”οΈ LandscapeForge

**An OpenEnv where an LLM agent designs optimization algorithms through a probe-draft-commit REPL, trained against a Goldilocks-regulated landscape adversary.**

Target: **OpenEnv Hackathon, April 2026**. Theme 4 (Self-Improvement), secondary Theme 1 (Multi-Agent).

---

## What this Space gives you

Two things, in one container:

| Path | What it is |
|---|---|
| **`/web`** | **Interactive Gradio demo** β€” landscape explorer + baseline race + paste-your-own-optimizer arena. Visual-first, meant to make the env legible to judges. |
| **`/reset`, `/step`, `/schema`, WebSocket** | **OpenEnv FastAPI endpoints** β€” wire the env into a TRL / Unsloth GRPO training loop. |

Same process, no second container required.

---

## How the env works (in 90 seconds)

**OptCoder** is the LLM policy. Each episode:

1. **LandscapeForge** (an internal template picker in v1) chooses a loss landscape `f: ℝⁿ β†’ ℝ` at a tier-appropriate difficulty: quadratic / Rosenbrock / Styblinski-Tang / Gaussian-mix / Himmelblau / plateau / cliff.
2. **OptCoder runs a 4-action REPL** with a budget of 12 units:
   - `run_baseline(name)` β€” run SGD / Momentum / Adam / L-BFGS on the hidden landscape and see their trajectory (cost: 2)
   - `draft(code)` β€” submit a full `Optimizer` class; the env auto-tests it for 20 steps (cost: 2)
   - `inspect(draft_idx, step_range)` β€” zoom into a prior draft's per-step `(x, f, grad, update_norm, step_size_eff)` to diagnose failures (cost: 1)
   - `commit` β€” evaluate the latest draft on the full **Phase-D arena**: 10 fresh seeds Γ— 200 steps (cost: 0)
3. **Reward** (terminal only; stepwise is feedback-only):
   - `r_regret` β€” **Adam-relative progress** (tuned Adam LR per landscape; no `f_min` dependency, generalises directly to NN training)
   - `r_convergence`, `r_robustness`, `r_novelty` (gated), minus `r_budget`, minus `r_eval_failures`
4. **GRPO** can then train the policy; arena wall-clock is ~50 ms so ~36k episodes/hour on one H100.

See `IMPLEMENTATION.md` and `LANDSCAPEFORGE_DESIGN.md` in this repo for the full spec, staged bootstrap (SFT β†’ solo RL β†’ adversarial unfreezing), and the anti-reward-hacking table.

---

## Quick-start (Python client)

```python
from landscapeforge import LandscapeforgeEnv, LandscapeforgeAction

with LandscapeforgeEnv.from_docker_image("landscapeforge-env:latest") as env:
    obs = env.reset()
    env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam"))
    env.step(LandscapeforgeAction(kind="draft", code="""
class Optimizer:
    def __init__(self, dim):
        self.lr = 0.05; self.beta = 0.9
        self.v = np.zeros(dim)
    def step(self, x, f_val, grad):
        self.v = self.beta * self.v - self.lr * grad
        return x + self.v
"""))
    result = env.step(LandscapeforgeAction(kind="commit"))
    print(result.observation.r_optcoder_breakdown)
    # {'r_regret': ..., 'r_convergence': ..., 'r_robustness': ...,
    #  'r_novelty': ..., 'r_budget': ..., 'r_eval_failures': ...,
    #  'my_progress': ..., 'adam_progress': ..., 'speedup_vs_adam': ...}
```

## Quick-start (drive with any OpenAI-compat LLM)

The repo ships `run_llm_episode.py` that drives one episode against any `/v1/chat/completions` endpoint (HuggingFace router, Ollama, vLLM, …):

```bash
# Ollama local
API_BASE_URL=http://localhost:11434/v1 MODEL_NAME=qwen2.5:3b \
  python -m landscapeforge.run_llm_episode

# HuggingFace router
HF_TOKEN=hf_xxx MODEL_NAME=Qwen/Qwen2.5-7B-Instruct \
  python -m landscapeforge.run_llm_episode
```

Full turn transcripts (prompt, raw reply, parsed action, env feedback, reward breakdown) are written to `episode_logs/*.jsonl` + `*.md`.

---

## What to click first in `/web`

1. **Baseline Race** tab β†’ pick Rosenbrock β†’ hit "🏁 Race!" to see how default-SGD, default-Momentum, **tuned-Adam**, and crude-L-BFGS actually perform on the classic stiff valley.
2. **Optimizer Arena** tab β†’ keep the sample SGD+Momentum optimizer, hit "βš”οΈ Run arena" to see the reward breakdown vs tuned Adam.
3. **Landscape Explorer** tab β†’ browse the 9 template families with contour plots + structural hints.

---

## Repo structure

```
landscapeforge/
β”œβ”€β”€ LANDSCAPEFORGE_DESIGN.md    # Full design doc (v0.2)
β”œβ”€β”€ IMPLEMENTATION.md           # What's in the code today + constants
β”œβ”€β”€ models.py                   # Action + Observation (pydantic)
β”œβ”€β”€ landscapes.py               # 9 analytic template builders with gradients
β”œβ”€β”€ reference_optimizers.py     # SGD / Momentum / Adam / L-BFGS + LR tuner
β”œβ”€β”€ sandbox.py                  # AST-strip + restricted exec + timeout
β”œβ”€β”€ arena.py                    # Phase-D runner + auto_test_draft
β”œβ”€β”€ rewards.py                  # Terminal reward + stepwise feedback
β”œβ”€β”€ prompts.py                  # obs β†’ prompt / response β†’ action
β”œβ”€β”€ run_llm_episode.py          # LLM-in-the-loop runner (OpenAI-compat)
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py                  # FastAPI + mounted Gradio at /web
β”‚   └── landscapeforge_environment.py  # OpenEnv Environment class
β”œβ”€β”€ demo/ui.py                  # Gradio UI source
β”œβ”€β”€ tests/test_episode.py       # Scripted end-to-end tests
└── episode_logs/               # Per-episode JSONL + Markdown transcripts
```

---

## Research anchors

LandscapeForge sits at the intersection of five established research threads:

- **Thread 1** β€” LLMs as optimizer designers: [Lion (NeurIPS 2023)](https://arxiv.org/abs/2302.06675), [FunSearch (Nature 2024)](https://www.nature.com/articles/s41586-023-06924-6)
- **Thread 2** β€” Adversarial / co-evolutionary LLM-env: Coevolve, [GenEnv (ICLR 2026)](https://arxiv.org/html/2512.19682v1)
- **Thread 3** β€” Iterative code refinement: [Self-Refine](https://arxiv.org/abs/2303.17651)
- **Thread 4** β€” GRPO with measurable rewards: [HPC GFLOPS reward paper](https://arxiv.org/abs/2602.12049v1)
- **Thread 5** β€” Analytical landscape benchmarks: [BBOB/COCO](https://inria.hal.science/hal-00362649/document), [POET](https://arxiv.org/abs/1901.01753)

Every ingredient has prior work; the combination β€” LLM-generated optimizers + LLM-picked landscapes + iterative REPL + GRPO on Adam-relative progress β€” is novel.

---

## License

Apache-2.0.