Spaces:
Sleeping
LandscapeForge β Implementation Notes (v0.1 code)
Status: env core working end-to-end in-process; scripted tests pass.
Ships what Β§18 v1 of LANDSCAPEFORGE_DESIGN.md specifies as the weekend MVP,
minus training and demo layers.
What's implemented
| File | Purpose |
|---|---|
models.py |
Unified LandscapeforgeAction (discriminated by kind) + LandscapeforgeObservation |
landscapes.py |
9 analytic template builders with hand-written gradients + TIER_MENU + structural_hints() |
reference_optimizers.py |
SGD / Momentum / Adam / crude L-BFGS + run_baseline() |
sandbox.py |
AST strip (keep only class Optimizer), safe globals, SIGALRM timeout; compile_optimizer() |
arena.py |
run_arena() for Phase-D eval + auto_test_draft() for draft-time feedback |
rewards.py |
Terminal reward (compute_optcoder_reward) + stepwise feedback (compute_step_reward) + ast_novelty_score |
server/landscapeforge_environment.py |
OpenEnv Environment subclass wiring everything together |
server/app.py |
FastAPI wrapper (scaffold, unchanged) |
client.py |
HTTP client over the unified action schema |
tests/test_episode.py |
3 scripted episodes, all passing |
Action space (Β§7.1)
Four actions with differentiated budget cost:
| Action | Cost | What it returns |
|---|---|---|
run_baseline(name) |
2 | Fixed 30-step trajectory (x_t, f_t, |g_t|); step count is env-controlled for comparability; source NOT revealed |
draft(code) |
2 | Auto-test summary on 1 seed Γ 20 steps + compile_error if any |
inspect(draft_idx, step_range) |
1 | Per-step detail (x, f, grad, update_norm, step_size_eff) from the referenced draft |
commit |
0 | Terminates, triggers Phase-D arena eval |
Budget total: 12 units. Hard ceiling of 6 drafts per episode prevents brute-force enumeration.
Auto-commit contract: if the agent never calls commit, budget exhaustion auto-triggers the same Phase-D arena evaluation on state.current_draft (i.e. the most recent draft the agent submitted). Whether the agent calls commit explicitly or hits budget exhaustion, the most recent draft is always what gets evaluated. Implemented in LandscapeforgeEnvironment._finalize_episode β current_draft is evaluated; only "no draft at all" produces worst-case regret. The prompt (prompts.SYSTEM) documents this contract to the LLM so it understands it isn't penalized for not committing, but should make sure its latest draft is its best one.
Reward
Two distinct signals β only the terminal one is used as the training scalar. Stepwise signals are in-context feedback for the LLM.
Terminal reward (the GRPO training scalar)
Computed once, at commit (or auto-commit on budget exhaustion), after the full Phase-D arena evaluation. Lives in obs.reward and obs.r_optcoder.
r_total = 1.0 Β· r_regret
+ 0.3 Β· r_convergence
+ 0.3 Β· r_robustness
+ 0.1 Β· r_novelty (gated: 0 unless r_regret > 0.5)
β 0.05 Β· r_budget
β 0.5 Β· r_eval_failures
Range: roughly [β1.55, +1.65] in practice. Weights live in rewards.py (W_*).
1. r_regret β the main signal (Adam-relative descent, NO f_min dependency)
Measures: how much further the committed optimizer descended on f(x) than Adam's default on the same landscape, starting from the same init. Purely relative β does not require knowing the absolute minimum.
Computed:
# Before running the Adam baseline, tune its LR per-landscape via a short
# sweep over {1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1} on a dedicated seed
# (30-step each). This keeps the comparison fair β agent must beat Adam-at-
# best-LR, not Adam-at-PyTorch-default.
best_lr = tune_adam_lr(f, grad, x0=sweep_seed_init, sweep_steps=30)
my_progress = mean over 10 arena seeds of (f_initial β f_final)
adam_progress = same, running Adam(lr=best_lr) on the same seeds
denom = max(adam_progress, 0.01 Β· mean|f_initial| + 1e-6)
r_regret = clamp( my_progress / denom β 1 , -1, +1 )
where f_initial and f_final are observable per-seed values from arena.initial_values and arena.final_values. Crashed seeds contribute 0 progress (conservative β "you didn't descend on that seed").
The denominator floor (~1% of initial f magnitude) protects against near-zero Adam progress exploding the ratio (e.g. on plateau landscapes where Adam barely moves).
Range: [β1, +1]
+1.0: descended β₯ 2Γ as far as Adam (clipped ceiling)0.0: matched Adam's descent exactlyβ1.0: made zero or negative progress while Adam descended normally (clipped floor)
Why this shape: Adam-relative normalization is scale-invariant β works the same on T0 quadratics (|f| ~ 10) and Rosenbrock (|f| ~ 1000) without hand-tuned per-landscape knobs. And crucially, it does NOT require knowing f_min β this design extends directly to neural-network training as Phase D (v3), where the global minimum of training loss is unknowable.
Bonus fields in the reward breakdown: my_progress, adam_progress, and speedup_vs_adam (= my_progress / denom) are logged alongside r_regret for diagnostics and human-readable leaderboards (e.g. "this optimizer is 10Γ faster than Adam on this landscape").
2. r_convergence β speed bonus
Measures: how quickly the committed optimizer drops f below 1% of the initial value on the first arena seed.
Computed:
r_conv = clamp( 1 β convergence_step / N , 0, 1 ) if converged
= 0.0 if never reached 1%
where N = ARENA_STEPS = 200 and convergence_step is the first t such that f(x_t) < 0.01 Β· f(x_0) on seed 101 (the first seed in ARENA_SEEDS).
Range: [0, 1]
1.0: converged at step 0 (impossible; asymptotic)0.5: converged at step 1000.0: never converged within 200 steps
Why: distinguishes fast optimizers from slow ones among those that do converge. Without this, an optimizer that reaches the minimum in 50 steps and one that reaches it in 199 get identical r_regret.
Only uses seed 101 β one seed's trajectory is enough to proxy speed; averaging across 10 would be more faithful but this is cheaper.
3. r_robustness β cross-seed consistency
Measures: whether the optimizer achieves similar final values across all 10 arena seeds, or whether it's luck-of-the-init sensitive.
Computed in ArenaResult.robustness:
r_robust = clamp( 1 β std(final_values) / |mean(final_values)| , 0, 1 )
using only seeds that didn't crash. If mean β 0, returns 1.0 when std is tiny else 0.0.
Range: [0, 1]
1.0: all 10 seeds ended at essentially the samefvalue (tight distribution)0.0: huge variance across seeds (works on some inits, fails on others)
Why: anti-"works-only-from-favorable-init" defense (Β§11). An optimizer that converges cleanly on seed 101 but diverges on seed 505 has low r_robustness even if mean_regret is okay.
4. r_novelty β structural departure from references
Measures: how structurally different the committed source is from the standard optimizers SGD/Adam/Momentum.
Computed via ast_novelty_score:
r_novelty = min over ref in {sgd, adam, momentum} of
1 β difflib.SequenceMatcher(committed, ref).ratio()
clamped to [0, 1]
Uses character-level diff ratio (difflib). 0.0 = byte-identical to one of the references. 1.0 = totally different strings. For reference, a tweaked-Adam with different hparams scores ~0.3; a genuinely different algorithm (line-search + trust region) scores ~0.7.
Range: [0, 1]
Gate: r_novelty is only applied when r_regret > 0.5. This prevents rewarding "novel AND broken" β you must beat Adam by a clear margin before creativity earns anything.
Why: prevents the model from just copying Adam after running run_baseline("adam"). Without this gate or term, the reward-maximizing strategy is "memorize Adam."
5. r_budget β penalty for over-spending budget
Measures: what fraction of the action budget was used.
Computed:
r_budget = clamp( budget_spent / 12 , 0, 1 )
where budget_spent is the sum of per-action costs (baseline=2, draft=2, inspect=1, commit=0), NOT the count of actions.
Range: [0, 1]
0.0: no budget used (impossible since at least one draft is needed)1.0: full budget consumed
Why: mild pressure toward efficiency. With W_BUDGET = 0.05, the swing between "committed at budget 4" and "exhausted at 12" is only 0.4 Γ 0.05 = 0.02 reward β deliberately small so it doesn't override algorithmic quality but is positive enough to discourage deliberate stalling.
6. r_eval_failures β crash penalty
Measures: fraction of arena seeds where the committed optimizer raised a SandboxError (NaN output, wrong shape, timeout, Python error).
Computed:
r_eval_failures = sum(arena.crashed) / 10
Range: [0, 1]
0.0: all 10 seeds ran to completion1.0: committed code crashes on every seed
Why: heavily weighted at W_EVAL_FAIL = 0.5 so a uniformly-crashing commit scores at least β0.5 regardless of any other component. Prevents "commit broken code to avoid bad eval" gaming (Β§11).
Concrete example
From the scripted test (quadratic dim=5, cond=44.4, SGD+momentum with lr=0.05 committed, progress-based reward, LR-tuned Adam baseline):
| Component | Value | Weighted contribution |
|---|---|---|
r_regret |
~0 (my_progress β adam_progress, both β10.51) | 0.000 |
r_convergence |
+0.835 | +0.251 |
r_robustness |
+0.5897 | +0.177 |
r_novelty |
0 (gated; r_regret < 0.5) | 0.000 |
r_budget |
0.583 (7/12 used) | β0.029 |
r_eval_failures |
0.0 (no crashes) | 0.0 |
r_total |
β | +0.398 |
Adam's tuned LR for this landscape came out to 0.03 β when LR is tuned, the committed SGD+momentum (lr=0.05) is essentially tied with Adam, and the reward correctly reflects that. Under the previous unfair baseline (Adam default lr=1e-3), the same draft would have scored 1.42. The ~1.0 reward swing reflects how much of the old "win" was just LR-tuning, not algorithmic merit.
Stepwise feedback (NOT training reward)
Computed in compute_step_reward. Surfaced to the LLM via obs.last_action_result["feedback"] after each non-terminal turn. Explicitly NOT summed into r_total.
phi_delta: change inβbest_auto_test_final_f / 10across this turn. Positive means the newest draft improved the best auto-test result. The LLM sees this and knows "I just made progress."compile_penalty: literal-0.1marker emitted whenever the latest draft failed to compile. Purely a flag for the LLM's context.
These are communication channels, not reward. Keeping them out of the training scalar preserves the terminal-only robustness property while still giving the LLM something to react to mid-episode.
Constants
| Constant | Value | Source |
|---|---|---|
BUDGET_TOTAL |
12 | server/landscapeforge_environment.py |
ACTION_COSTS |
baseline=2, draft=2, inspect=1, commit=0 | models.py |
ARENA_SEEDS |
[101, 202, ..., 1010] (10 fresh seeds) |
server/landscapeforge_environment.py |
ARENA_STEPS |
200 | same |
BASELINE_STEPS |
30 | same (fixed; agent cannot override) |
| Adam LR sweep grid | {1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1} |
reference_optimizers.tune_adam_lr |
| Adam LR sweep steps | 30 | same |
| Adam LR sweep init seed | 0 (not in ARENA_SEEDS) | _ensure_adam_arena in env |
| Draft auto-test init seed | 0 | arena.auto_test_draft |
| Draft auto-test steps | 20 | same |
Init scale (seed-sampled x0) |
N(0, 0.5Β² I) |
arena.run_arena + auto_test_draft |
| Dim range per episode | 2β5 (v1) | server/landscapeforge_environment.py |
| Sandbox init timeout | 1.0 s | sandbox.compile_optimizer |
| Sandbox step timeout | 0.5 s | same |
| Reward weights | w_regret=1.0, w_conv=0.3, w_robust=0.3, w_novelty=0.1, w_budget=0.05, w_evalfail=0.5 | rewards.py |
| Novelty gate | Applied only when r_regret > 0.5 |
rewards.NOVELTY_GATE |
PHI_SCALE (potential normalizer) |
10.0 | rewards.py |
COMPILE_PENALTY_SIGNAL |
-0.1 | rewards.py |
| Tier menus | T0: quadratic/styblinski_tang/huber Β· T1: +gaussian_mix/himmelblau Β· T2: +rosenbrock/stiff_quadratic/plateau/cliff | landscapes.TIER_MENU |
| Quadratic cond-number cap per tier | T0: 100, T1: 1000, T2: 10000 | _sample_params in env |
Assumptions / simplifications (v1)
- LandscapeForge is a template picker, not a free-form code author. The env internally samples (template, params) uniformly from the active tier's menu β no LandscapeForge LLM adapter in v1. Defers Β§18 v2 non-differentiability and gradient-source risks.
- All gradients are analytic, hand-written per template. No autodiff/JAX, no finite differences. Templates are verified differentiable by construction.
2b. Reward does NOT depend on
f_min. v0.2 switched fromr_regret = 1 β (my_regret / adam_regret)(which required knowing the global minimum) to a progress-based form:r_regret = clamp(my_progress / adam_progress β 1, β1, +1)whereprogress = f_initial β f_finalis observable per seed.Landscape.f_minis retained only for diagnostics, NOT used in training. This unlocks v3 NN extension (training loss has no knowable minimum). - Only OptCoder has a policy. The OpenEnv
Environmenthere exposes the OptCoder side; LandscapeForge selection is internal. - Single backbone assumption (Qwen2.5-3B base + OptCoder LoRA) is in the design but not in code; training script is not yet implemented.
- Sandbox is in-process + SIGALRM timeout. Works on main thread / CPython / POSIX. Known bug: HTTP
/stepvia uvicorn returns 500 because SIGALRM only fires on the main thread; thread-based timeout fix is TODO. - AST strip drops all module-level code except
class Optimizer. Imports are also dropped β the sandbox pre-injectsnpandnumpyinto globals, so submitted code can usenp.*without an import line. - Dim range 2β5 for v1 even though the design allows up to 100. Keeps arena eval fast (~30 ms/episode) and keeps the prompt token budget tight.
- Adam baseline for reward normalization is run inside the env on every commit to compute
baseline_adam_regret. Cost: one 200-step Γ 10-seed arena run per episode on top of the OptCoder eval. ~30 ms, acceptable. - AST novelty score uses difflib (character-level Levenshtein-ish) rather than true AST diff. Enough to detect "commit β reference" but not semantically rigorous. Upgrade path noted.
- Tier advancement is not auto-wired.
env.advance_tier(new_tier)exists as a manual API; rolling-regret-based auto-advance is a trainer-side concern and not yet implemented.
How to run
cd landscapeforge
uv sync # installs deps
uv run python tests/test_episode.py # 3 scripted episodes
# Local dev server with Gradio at /web:
uv run uvicorn landscapeforge.server.app:app --host 127.0.0.1 --port 8000
# β http://localhost:8000/web (Gradio demo)
# β http://localhost:8000/schema (OpenEnv schema)
Pushing to HF Spaces
Use --no-interface β our custom Gradio mount at /web replaces OpenEnv's
built-in UI (the built-in passes theme= to gr.mount_gradio_app, which
Gradio 5.x dropped):
openenv push . --exclude .hfignore --no-interface
Without --no-interface, openenv push injects ENV ENABLE_WEB_INTERFACE=true
into the Dockerfile β OpenEnv's create_app routes through
create_web_interface_app β crashes on Gradio 5 incompatibility.
Expected output: three β PASSED lines, final line All tests passed.
In-process usage (no server needed):
from landscapeforge.models import LandscapeforgeAction
from landscapeforge.server.landscapeforge_environment import LandscapeforgeEnvironment
env = LandscapeforgeEnvironment(tier="T0", seed=42)
obs = env.reset()
obs = env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam"))
obs = env.step(LandscapeforgeAction(kind="draft", code="...Optimizer class..."))
obs = env.step(LandscapeforgeAction(kind="commit"))
print(obs.reward, obs.r_optcoder_breakdown)
HTTP server starts with uv run uvicorn landscapeforge.server.app:app. /reset and /schema work; /step currently returns 500 (see assumption 5).
Known gaps (tracked for next passes)
- SFT warm-start corpus: ~200 hand-authored
run_baseline β draft β inspect β draft β committraces (Β§15 Stage 0) - GRPO training script using TRL + HF transformers
- Prompt renderer: format
obsinto the LLM prompt template from Appendix A - Curriculum auto-advancement (rolling-mean-regret watchdog on top of
env.advance_tier) - Gradio demo Space with contour + trajectory animation
- Thread-based sandbox timeout to unblock HTTP
/step - True AST-diff-based novelty (replace difflib)
- Docker image + HF Spaces push
Non-goals (v1)
- Free-form LandscapeForge code authoring (deferred to v2 per Β§18)
- Non-differentiable landscape defense (moot while LandscapeForge is template-picker)
- Multi-turn LandscapeForge-vs-OptCoder within a single episode (sequential only)
- Neural-net-as-landscape Phase-D (v3)