# LandscapeForge — Implementation Notes (v0.1 code)

Status: **env core working end-to-end in-process**; scripted tests pass.
Ships what §18 v1 of `LANDSCAPEFORGE_DESIGN.md` specifies as the weekend MVP,
minus training and demo layers.

## What's implemented

| File | Purpose |
|---|---|
| `models.py` | Unified `LandscapeforgeAction` (discriminated by `kind`) + `LandscapeforgeObservation` |
| `landscapes.py` | 9 analytic template builders with hand-written gradients + `TIER_MENU` + `structural_hints()` |
| `reference_optimizers.py` | SGD / Momentum / Adam / crude L-BFGS + `run_baseline()` |
| `sandbox.py` | AST strip (keep only `class Optimizer`), safe globals, SIGALRM timeout; `compile_optimizer()` |
| `arena.py` | `run_arena()` for Phase-D eval + `auto_test_draft()` for draft-time feedback |
| `rewards.py` | Terminal reward (`compute_optcoder_reward`) + stepwise feedback (`compute_step_reward`) + `ast_novelty_score` |
| `server/landscapeforge_environment.py` | OpenEnv `Environment` subclass wiring everything together |
| `server/app.py` | FastAPI wrapper (scaffold, unchanged) |
| `client.py` | HTTP client over the unified action schema |
| `tests/test_episode.py` | 3 scripted episodes, all passing |

## Action space (§7.1)

Four actions with differentiated budget cost:

| Action | Cost | What it returns |
|---|---|---|
| `run_baseline(name)` | 2 | Fixed 30-step trajectory `(x_t, f_t, \|g_t\|)`; step count is env-controlled for comparability; source NOT revealed |
| `draft(code)` | 2 | Auto-test summary on 1 seed × 20 steps + compile_error if any |
| `inspect(draft_idx, step_range)` | 1 | Per-step detail `(x, f, grad, update_norm, step_size_eff)` from the referenced draft |
| `commit` | 0 | Terminates, triggers Phase-D arena eval |

Budget total: **12 units**. Hard ceiling of 6 drafts per episode prevents brute-force enumeration.

**Auto-commit contract:** if the agent never calls `commit`, budget exhaustion auto-triggers the same Phase-D arena evaluation on `state.current_draft` (i.e. the most recent draft the agent submitted). Whether the agent calls `commit` explicitly or hits budget exhaustion, **the most recent draft is always what gets evaluated**. Implemented in `LandscapeforgeEnvironment._finalize_episode` — `current_draft` is evaluated; only "no draft at all" produces worst-case regret. The prompt (`prompts.SYSTEM`) documents this contract to the LLM so it understands it isn't penalized for not committing, but should make sure its *latest* draft is its best one.

## Reward

Two distinct signals — only the terminal one is used as the training scalar. Stepwise signals are in-context feedback for the LLM.

### Terminal reward (the GRPO training scalar)

Computed once, at commit (or auto-commit on budget exhaustion), after the full Phase-D arena evaluation. Lives in `obs.reward` and `obs.r_optcoder`.

```
r_total = 1.0 · r_regret
        + 0.3 · r_convergence
        + 0.3 · r_robustness
        + 0.1 · r_novelty           (gated: 0 unless r_regret > 0.5)
        − 0.05 · r_budget
        − 0.5 · r_eval_failures
```

Range: roughly **[−1.55, +1.65]** in practice. Weights live in `rewards.py` (`W_*`).

#### 1. `r_regret` — the main signal (Adam-relative descent, NO `f_min` dependency)

**Measures:** how much further the committed optimizer descended on `f(x)` than Adam's default on the same landscape, starting from the same init. Purely relative — does not require knowing the absolute minimum.

**Computed:**
```
# Before running the Adam baseline, tune its LR per-landscape via a short
# sweep over {1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1} on a dedicated seed
# (30-step each). This keeps the comparison fair — agent must beat Adam-at-
# best-LR, not Adam-at-PyTorch-default.
best_lr       = tune_adam_lr(f, grad, x0=sweep_seed_init, sweep_steps=30)

my_progress   = mean over 10 arena seeds of (f_initial − f_final)
adam_progress = same, running Adam(lr=best_lr) on the same seeds
denom         = max(adam_progress, 0.01 · mean|f_initial| + 1e-6)
r_regret      = clamp( my_progress / denom − 1 , -1, +1 )
```

where `f_initial` and `f_final` are observable per-seed values from `arena.initial_values` and `arena.final_values`. Crashed seeds contribute 0 progress (conservative — "you didn't descend on that seed").

The denominator floor (~1% of initial f magnitude) protects against near-zero Adam progress exploding the ratio (e.g. on plateau landscapes where Adam barely moves).

**Range:** [−1, +1]
- `+1.0`: descended ≥ 2× as far as Adam (clipped ceiling)
- `0.0`: matched Adam's descent exactly
- `−1.0`: made zero or negative progress while Adam descended normally (clipped floor)

**Why this shape:** Adam-relative normalization is scale-invariant — works the same on T0 quadratics (|f| ~ 10) and Rosenbrock (|f| ~ 1000) without hand-tuned per-landscape knobs. And crucially, **it does NOT require knowing `f_min`** — this design extends directly to neural-network training as Phase D (v3), where the global minimum of training loss is unknowable.

**Bonus fields in the reward breakdown:** `my_progress`, `adam_progress`, and `speedup_vs_adam` (= `my_progress / denom`) are logged alongside `r_regret` for diagnostics and human-readable leaderboards (e.g. "this optimizer is 10× faster than Adam on this landscape").

#### 2. `r_convergence` — speed bonus

**Measures:** how quickly the committed optimizer drops `f` below 1% of the initial value on the first arena seed.

**Computed:**
```
r_conv = clamp( 1 − convergence_step / N , 0, 1 )     if converged
       = 0.0                                           if never reached 1%
```
where `N = ARENA_STEPS = 200` and `convergence_step` is the first `t` such that `f(x_t) < 0.01 · f(x_0)` on seed 101 (the first seed in `ARENA_SEEDS`).

**Range:** [0, 1]
- `1.0`: converged at step 0 (impossible; asymptotic)
- `0.5`: converged at step 100
- `0.0`: never converged within 200 steps

**Why:** distinguishes fast optimizers from slow ones among those that do converge. Without this, an optimizer that reaches the minimum in 50 steps and one that reaches it in 199 get identical `r_regret`.

**Only uses seed 101** — one seed's trajectory is enough to proxy speed; averaging across 10 would be more faithful but this is cheaper.

#### 3. `r_robustness` — cross-seed consistency

**Measures:** whether the optimizer achieves similar final values across all 10 arena seeds, or whether it's luck-of-the-init sensitive.

**Computed in `ArenaResult.robustness`:**
```
r_robust = clamp( 1 − std(final_values) / |mean(final_values)| , 0, 1 )
```
using only seeds that didn't crash. If mean ≈ 0, returns 1.0 when std is tiny else 0.0.

**Range:** [0, 1]
- `1.0`: all 10 seeds ended at essentially the same `f` value (tight distribution)
- `0.0`: huge variance across seeds (works on some inits, fails on others)

**Why:** anti-"works-only-from-favorable-init" defense (§11). An optimizer that converges cleanly on seed 101 but diverges on seed 505 has low r_robustness even if mean_regret is okay.

#### 4. `r_novelty` — structural departure from references

**Measures:** how structurally different the committed source is from the standard optimizers SGD/Adam/Momentum.

**Computed via `ast_novelty_score`:**
```
r_novelty = min over ref in {sgd, adam, momentum} of
              1 − difflib.SequenceMatcher(committed, ref).ratio()
          clamped to [0, 1]
```
Uses character-level diff ratio (difflib). `0.0` = byte-identical to one of the references. `1.0` = totally different strings. For reference, a tweaked-Adam with different hparams scores ~0.3; a genuinely different algorithm (line-search + trust region) scores ~0.7.

**Range:** [0, 1]

**Gate:** `r_novelty` is **only applied when `r_regret > 0.5`**. This prevents rewarding "novel AND broken" — you must beat Adam by a clear margin before creativity earns anything.

**Why:** prevents the model from just copying Adam after running `run_baseline("adam")`. Without this gate or term, the reward-maximizing strategy is "memorize Adam."

#### 5. `r_budget` — penalty for over-spending budget

**Measures:** what fraction of the action budget was used.

**Computed:**
```
r_budget = clamp( budget_spent / 12 , 0, 1 )
```
where `budget_spent` is the sum of per-action costs (baseline=2, draft=2, inspect=1, commit=0), NOT the count of actions.

**Range:** [0, 1]
- `0.0`: no budget used (impossible since at least one draft is needed)
- `1.0`: full budget consumed

**Why:** mild pressure toward efficiency. With `W_BUDGET = 0.05`, the swing between "committed at budget 4" and "exhausted at 12" is only 0.4 × 0.05 = 0.02 reward — deliberately small so it doesn't override algorithmic quality but is positive enough to discourage deliberate stalling.

#### 6. `r_eval_failures` — crash penalty

**Measures:** fraction of arena seeds where the committed optimizer raised a `SandboxError` (NaN output, wrong shape, timeout, Python error).

**Computed:**
```
r_eval_failures = sum(arena.crashed) / 10
```

**Range:** [0, 1]
- `0.0`: all 10 seeds ran to completion
- `1.0`: committed code crashes on every seed

**Why:** heavily weighted at `W_EVAL_FAIL = 0.5` so a uniformly-crashing commit scores at least −0.5 regardless of any other component. Prevents "commit broken code to avoid bad eval" gaming (§11).

### Concrete example

From the scripted test (quadratic dim=5, cond=44.4, SGD+momentum with lr=0.05 committed, progress-based reward, LR-tuned Adam baseline):

| Component | Value | Weighted contribution |
|---|---|---|
| `r_regret` | ~0 (my_progress ≈ adam_progress, both ≈10.51) | 0.000 |
| `r_convergence` | +0.835 | +0.251 |
| `r_robustness` | +0.5897 | +0.177 |
| `r_novelty` | 0 (gated; r_regret < 0.5) | 0.000 |
| `r_budget` | 0.583 (7/12 used) | −0.029 |
| `r_eval_failures` | 0.0 (no crashes) | 0.0 |
| **`r_total`** | — | **+0.398** |

Adam's tuned LR for this landscape came out to `0.03` — when LR is tuned, the committed SGD+momentum (lr=0.05) is essentially **tied** with Adam, and the reward correctly reflects that. Under the previous unfair baseline (Adam default lr=1e-3), the same draft would have scored 1.42. The ~1.0 reward swing reflects how much of the old "win" was just LR-tuning, not algorithmic merit.

### Stepwise feedback (NOT training reward)

Computed in `compute_step_reward`. Surfaced to the LLM via `obs.last_action_result["feedback"]` after each non-terminal turn. Explicitly NOT summed into `r_total`.

- `phi_delta`: change in `−best_auto_test_final_f / 10` across this turn. Positive means the newest draft improved the best auto-test result. The LLM sees this and knows "I just made progress."
- `compile_penalty`: literal `-0.1` marker emitted whenever the latest draft failed to compile. Purely a flag for the LLM's context.

These are communication channels, not reward. Keeping them out of the training scalar preserves the terminal-only robustness property while still giving the LLM something to react to mid-episode.

## Constants

| Constant | Value | Source |
|---|---|---|
| `BUDGET_TOTAL` | 12 | `server/landscapeforge_environment.py` |
| `ACTION_COSTS` | baseline=2, draft=2, inspect=1, commit=0 | `models.py` |
| `ARENA_SEEDS` | `[101, 202, ..., 1010]` (10 fresh seeds) | `server/landscapeforge_environment.py` |
| `ARENA_STEPS` | 200 | same |
| `BASELINE_STEPS` | 30 | same (fixed; agent cannot override) |
| Adam LR sweep grid | `{1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1}` | `reference_optimizers.tune_adam_lr` |
| Adam LR sweep steps | 30 | same |
| Adam LR sweep init seed | 0 (not in ARENA_SEEDS) | `_ensure_adam_arena` in env |
| Draft auto-test init seed | 0 | `arena.auto_test_draft` |
| Draft auto-test steps | 20 | same |
| Init scale (seed-sampled `x0`) | `N(0, 0.5² I)` | `arena.run_arena` + `auto_test_draft` |
| Dim range per episode | 2–5 (v1) | `server/landscapeforge_environment.py` |
| Sandbox init timeout | 1.0 s | `sandbox.compile_optimizer` |
| Sandbox step timeout | 0.5 s | same |
| Reward weights | w_regret=1.0, w_conv=0.3, w_robust=0.3, w_novelty=0.1, w_budget=0.05, w_evalfail=0.5 | `rewards.py` |
| Novelty gate | Applied only when `r_regret > 0.5` | `rewards.NOVELTY_GATE` |
| `PHI_SCALE` (potential normalizer) | 10.0 | `rewards.py` |
| `COMPILE_PENALTY_SIGNAL` | -0.1 | `rewards.py` |
| Tier menus | T0: quadratic/styblinski_tang/huber · T1: +gaussian_mix/himmelblau · T2: +rosenbrock/stiff_quadratic/plateau/cliff | `landscapes.TIER_MENU` |
| Quadratic cond-number cap per tier | T0: 100, T1: 1000, T2: 10000 | `_sample_params` in env |

## Assumptions / simplifications (v1)

1. **LandscapeForge is a template picker**, not a free-form code author. The env internally samples (template, params) uniformly from the active tier's menu — no LandscapeForge LLM adapter in v1. Defers §18 v2 non-differentiability and gradient-source risks.
2. **All gradients are analytic**, hand-written per template. No autodiff/JAX, no finite differences. Templates are verified differentiable by construction.
2b. **Reward does NOT depend on `f_min`.** v0.2 switched from `r_regret = 1 − (my_regret / adam_regret)` (which required knowing the global minimum) to a progress-based form: `r_regret = clamp(my_progress / adam_progress − 1, −1, +1)` where `progress = f_initial − f_final` is observable per seed. `Landscape.f_min` is retained only for diagnostics, NOT used in training. This unlocks v3 NN extension (training loss has no knowable minimum).
3. **Only OptCoder has a policy.** The OpenEnv `Environment` here exposes the OptCoder side; LandscapeForge selection is internal.
4. **Single backbone assumption** (Qwen2.5-3B base + OptCoder LoRA) is in the design but not in code; training script is not yet implemented.
5. **Sandbox is in-process + SIGALRM timeout.** Works on main thread / CPython / POSIX. Known bug: HTTP `/step` via uvicorn returns 500 because SIGALRM only fires on the main thread; thread-based timeout fix is TODO.
6. **AST strip drops all module-level code except `class Optimizer`.** Imports are also dropped — the sandbox pre-injects `np` and `numpy` into globals, so submitted code can use `np.*` without an import line.
7. **Dim range 2–5** for v1 even though the design allows up to 100. Keeps arena eval fast (~30 ms/episode) and keeps the prompt token budget tight.
8. **Adam baseline for reward normalization is run inside the env** on every commit to compute `baseline_adam_regret`. Cost: one 200-step × 10-seed arena run per episode on top of the OptCoder eval. ~30 ms, acceptable.
9. **AST novelty score uses difflib** (character-level Levenshtein-ish) rather than true AST diff. Enough to detect "commit ≈ reference" but not semantically rigorous. Upgrade path noted.
10. **Tier advancement is not auto-wired.** `env.advance_tier(new_tier)` exists as a manual API; rolling-regret-based auto-advance is a trainer-side concern and not yet implemented.

## How to run

```bash
cd landscapeforge
uv sync                                # installs deps
uv run python tests/test_episode.py    # 3 scripted episodes

# Local dev server with Gradio at /web:
uv run uvicorn landscapeforge.server.app:app --host 127.0.0.1 --port 8000
# → http://localhost:8000/web  (Gradio demo)
# → http://localhost:8000/schema  (OpenEnv schema)
```

### Pushing to HF Spaces

Use `--no-interface` — our custom Gradio mount at `/web` replaces OpenEnv's
built-in UI (the built-in passes `theme=` to `gr.mount_gradio_app`, which
Gradio 5.x dropped):

```bash
openenv push . --exclude .hfignore --no-interface
```

Without `--no-interface`, `openenv push` injects `ENV ENABLE_WEB_INTERFACE=true`
into the Dockerfile → OpenEnv's `create_app` routes through
`create_web_interface_app` → crashes on Gradio 5 incompatibility.

Expected output: three `✓ PASSED` lines, final line `All tests passed.`

In-process usage (no server needed):

```python
from landscapeforge.models import LandscapeforgeAction
from landscapeforge.server.landscapeforge_environment import LandscapeforgeEnvironment

env = LandscapeforgeEnvironment(tier="T0", seed=42)
obs = env.reset()
obs = env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam"))
obs = env.step(LandscapeforgeAction(kind="draft", code="...Optimizer class..."))
obs = env.step(LandscapeforgeAction(kind="commit"))
print(obs.reward, obs.r_optcoder_breakdown)
```

HTTP server starts with `uv run uvicorn landscapeforge.server.app:app`. `/reset` and `/schema` work; `/step` currently returns 500 (see assumption 5).

## Known gaps (tracked for next passes)

- SFT warm-start corpus: ~200 hand-authored `run_baseline → draft → inspect → draft → commit` traces (§15 Stage 0)
- GRPO training script using TRL + HF transformers
- Prompt renderer: format `obs` into the LLM prompt template from Appendix A
- Curriculum auto-advancement (rolling-mean-regret watchdog on top of `env.advance_tier`)
- Gradio demo Space with contour + trajectory animation
- Thread-based sandbox timeout to unblock HTTP `/step`
- True AST-diff-based novelty (replace difflib)
- Docker image + HF Spaces push

## Non-goals (v1)

- Free-form LandscapeForge code authoring (deferred to v2 per §18)
- Non-differentiable landscape defense (moot while LandscapeForge is template-picker)
- Multi-turn LandscapeForge-vs-OptCoder within a single episode (sequential only)
- Neural-net-as-landscape Phase-D (v3)