Spaces:

mnawfal29
/

landscapeforge

Sleeping

App Files Files Community

landscapeforge / IMPLEMENTATION.md

mnawfal29

Upload folder using huggingface_hub

320ea07 verified 13 days ago

preview code

raw

history blame contribute delete

17.4 kB

LandscapeForge — Implementation Notes (v0.1 code)

Status: env core working end-to-end in-process; scripted tests pass. Ships what §18 v1 of LANDSCAPEFORGE_DESIGN.md specifies as the weekend MVP, minus training and demo layers.

What's implemented

File	Purpose
`models.py`	Unified `LandscapeforgeAction` (discriminated by `kind`) + `LandscapeforgeObservation`
`landscapes.py`	9 analytic template builders with hand-written gradients + `TIER_MENU` + `structural_hints()`
`reference_optimizers.py`	SGD / Momentum / Adam / crude L-BFGS + `run_baseline()`
`sandbox.py`	AST strip (keep only `class Optimizer`), safe globals, SIGALRM timeout; `compile_optimizer()`
`arena.py`	`run_arena()` for Phase-D eval + `auto_test_draft()` for draft-time feedback
`rewards.py`	Terminal reward (`compute_optcoder_reward`) + stepwise feedback (`compute_step_reward`) + `ast_novelty_score`
`server/landscapeforge_environment.py`	OpenEnv `Environment` subclass wiring everything together
`server/app.py`	FastAPI wrapper (scaffold, unchanged)
`client.py`	HTTP client over the unified action schema
`tests/test_episode.py`	3 scripted episodes, all passing

Action space (§7.1)

Four actions with differentiated budget cost:

Action	Cost	What it returns
`run_baseline(name)`	2	Fixed 30-step trajectory `(x_t, f_t, \|g_t\|)`; step count is env-controlled for comparability; source NOT revealed
`draft(code)`	2	Auto-test summary on 1 seed × 20 steps + compile_error if any
`inspect(draft_idx, step_range)`	1	Per-step detail `(x, f, grad, update_norm, step_size_eff)` from the referenced draft
`commit`	0	Terminates, triggers Phase-D arena eval

Budget total: 12 units. Hard ceiling of 6 drafts per episode prevents brute-force enumeration.

Auto-commit contract: if the agent never calls commit, budget exhaustion auto-triggers the same Phase-D arena evaluation on state.current_draft (i.e. the most recent draft the agent submitted). Whether the agent calls commit explicitly or hits budget exhaustion, the most recent draft is always what gets evaluated. Implemented in LandscapeforgeEnvironment._finalize_episode — current_draft is evaluated; only "no draft at all" produces worst-case regret. The prompt (prompts.SYSTEM) documents this contract to the LLM so it understands it isn't penalized for not committing, but should make sure its latest draft is its best one.

Reward

Two distinct signals — only the terminal one is used as the training scalar. Stepwise signals are in-context feedback for the LLM.

Terminal reward (the GRPO training scalar)

Computed once, at commit (or auto-commit on budget exhaustion), after the full Phase-D arena evaluation. Lives in obs.reward and obs.r_optcoder.

r_total = 1.0 · r_regret
        + 0.3 · r_convergence
        + 0.3 · r_robustness
        + 0.1 · r_novelty           (gated: 0 unless r_regret > 0.5)
        − 0.05 · r_budget
        − 0.5 · r_eval_failures

Range: roughly [−1.55, +1.65] in practice. Weights live in rewards.py (W_*).

1. `r_regret` — the main signal (Adam-relative descent, NO `f_min` dependency)

Measures: how much further the committed optimizer descended on f(x) than Adam's default on the same landscape, starting from the same init. Purely relative — does not require knowing the absolute minimum.

Computed:

# Before running the Adam baseline, tune its LR per-landscape via a short
# sweep over {1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1} on a dedicated seed
# (30-step each). This keeps the comparison fair — agent must beat Adam-at-
# best-LR, not Adam-at-PyTorch-default.
best_lr       = tune_adam_lr(f, grad, x0=sweep_seed_init, sweep_steps=30)

my_progress   = mean over 10 arena seeds of (f_initial − f_final)
adam_progress = same, running Adam(lr=best_lr) on the same seeds
denom         = max(adam_progress, 0.01 · mean|f_initial| + 1e-6)
r_regret      = clamp( my_progress / denom − 1 , -1, +1 )

where f_initial and f_final are observable per-seed values from arena.initial_values and arena.final_values. Crashed seeds contribute 0 progress (conservative — "you didn't descend on that seed").

The denominator floor (~1% of initial f magnitude) protects against near-zero Adam progress exploding the ratio (e.g. on plateau landscapes where Adam barely moves).

Range: [−1, +1]

+1.0: descended ≥ 2× as far as Adam (clipped ceiling)
0.0: matched Adam's descent exactly
−1.0: made zero or negative progress while Adam descended normally (clipped floor)

Why this shape: Adam-relative normalization is scale-invariant — works the same on T0 quadratics (|f| ~ 10) and Rosenbrock (|f| ~ 1000) without hand-tuned per-landscape knobs. And crucially, it does NOT require knowing f_min — this design extends directly to neural-network training as Phase D (v3), where the global minimum of training loss is unknowable.

Bonus fields in the reward breakdown: my_progress, adam_progress, and speedup_vs_adam (= my_progress / denom) are logged alongside r_regret for diagnostics and human-readable leaderboards (e.g. "this optimizer is 10× faster than Adam on this landscape").

2. `r_convergence` — speed bonus

Measures: how quickly the committed optimizer drops f below 1% of the initial value on the first arena seed.

Computed:

r_conv = clamp( 1 − convergence_step / N , 0, 1 )     if converged
       = 0.0                                           if never reached 1%

where N = ARENA_STEPS = 200 and convergence_step is the first t such that f(x_t) < 0.01 · f(x_0) on seed 101 (the first seed in ARENA_SEEDS).

Range: [0, 1]

1.0: converged at step 0 (impossible; asymptotic)
0.5: converged at step 100
0.0: never converged within 200 steps

Why: distinguishes fast optimizers from slow ones among those that do converge. Without this, an optimizer that reaches the minimum in 50 steps and one that reaches it in 199 get identical r_regret.

Only uses seed 101 — one seed's trajectory is enough to proxy speed; averaging across 10 would be more faithful but this is cheaper.

3. `r_robustness` — cross-seed consistency

Measures: whether the optimizer achieves similar final values across all 10 arena seeds, or whether it's luck-of-the-init sensitive.

Computed in ArenaResult.robustness:

r_robust = clamp( 1 − std(final_values) / |mean(final_values)| , 0, 1 )

using only seeds that didn't crash. If mean ≈ 0, returns 1.0 when std is tiny else 0.0.

Range: [0, 1]

1.0: all 10 seeds ended at essentially the same f value (tight distribution)
0.0: huge variance across seeds (works on some inits, fails on others)

Why: anti-"works-only-from-favorable-init" defense (§11). An optimizer that converges cleanly on seed 101 but diverges on seed 505 has low r_robustness even if mean_regret is okay.

4. `r_novelty` — structural departure from references

Measures: how structurally different the committed source is from the standard optimizers SGD/Adam/Momentum.

Computed via ast_novelty_score:

r_novelty = min over ref in {sgd, adam, momentum} of
              1 − difflib.SequenceMatcher(committed, ref).ratio()
          clamped to [0, 1]

Uses character-level diff ratio (difflib). 0.0 = byte-identical to one of the references. 1.0 = totally different strings. For reference, a tweaked-Adam with different hparams scores ~0.3; a genuinely different algorithm (line-search + trust region) scores ~0.7.

Range: [0, 1]

Gate: r_novelty is only applied when r_regret > 0.5. This prevents rewarding "novel AND broken" — you must beat Adam by a clear margin before creativity earns anything.

Why: prevents the model from just copying Adam after running run_baseline("adam"). Without this gate or term, the reward-maximizing strategy is "memorize Adam."

5. `r_budget` — penalty for over-spending budget

Measures: what fraction of the action budget was used.

Computed:

r_budget = clamp( budget_spent / 12 , 0, 1 )

where budget_spent is the sum of per-action costs (baseline=2, draft=2, inspect=1, commit=0), NOT the count of actions.

Range: [0, 1]

0.0: no budget used (impossible since at least one draft is needed)
1.0: full budget consumed

Why: mild pressure toward efficiency. With W_BUDGET = 0.05, the swing between "committed at budget 4" and "exhausted at 12" is only 0.4 × 0.05 = 0.02 reward — deliberately small so it doesn't override algorithmic quality but is positive enough to discourage deliberate stalling.

6. `r_eval_failures` — crash penalty

Measures: fraction of arena seeds where the committed optimizer raised a SandboxError (NaN output, wrong shape, timeout, Python error).

Computed:

r_eval_failures = sum(arena.crashed) / 10

Range: [0, 1]

0.0: all 10 seeds ran to completion
1.0: committed code crashes on every seed

Why: heavily weighted at W_EVAL_FAIL = 0.5 so a uniformly-crashing commit scores at least −0.5 regardless of any other component. Prevents "commit broken code to avoid bad eval" gaming (§11).

Concrete example

From the scripted test (quadratic dim=5, cond=44.4, SGD+momentum with lr=0.05 committed, progress-based reward, LR-tuned Adam baseline):

Component	Value	Weighted contribution
`r_regret`	~0 (my_progress ≈ adam_progress, both ≈10.51)	0.000
`r_convergence`	+0.835	+0.251
`r_robustness`	+0.5897	+0.177
`r_novelty`	0 (gated; r_regret < 0.5)	0.000
`r_budget`	0.583 (7/12 used)	−0.029
`r_eval_failures`	0.0 (no crashes)	0.0
`r_total`	—	+0.398

Adam's tuned LR for this landscape came out to 0.03 — when LR is tuned, the committed SGD+momentum (lr=0.05) is essentially tied with Adam, and the reward correctly reflects that. Under the previous unfair baseline (Adam default lr=1e-3), the same draft would have scored 1.42. The ~1.0 reward swing reflects how much of the old "win" was just LR-tuning, not algorithmic merit.

Stepwise feedback (NOT training reward)

Computed in compute_step_reward. Surfaced to the LLM via obs.last_action_result["feedback"] after each non-terminal turn. Explicitly NOT summed into r_total.

phi_delta: change in −best_auto_test_final_f / 10 across this turn. Positive means the newest draft improved the best auto-test result. The LLM sees this and knows "I just made progress."
compile_penalty: literal -0.1 marker emitted whenever the latest draft failed to compile. Purely a flag for the LLM's context.

These are communication channels, not reward. Keeping them out of the training scalar preserves the terminal-only robustness property while still giving the LLM something to react to mid-episode.

Constants

Constant	Value	Source
`BUDGET_TOTAL`	12	`server/landscapeforge_environment.py`
`ACTION_COSTS`	baseline=2, draft=2, inspect=1, commit=0	`models.py`
`ARENA_SEEDS`	`[101, 202, ..., 1010]` (10 fresh seeds)	`server/landscapeforge_environment.py`
`ARENA_STEPS`	200	same
`BASELINE_STEPS`	30	same (fixed; agent cannot override)
Adam LR sweep grid	`{1e-4, 1e-3, 3e-3, 1e-2, 3e-2, 1e-1, 3e-1}`	`reference_optimizers.tune_adam_lr`
Adam LR sweep steps	30	same
Adam LR sweep init seed	0 (not in ARENA_SEEDS)	`_ensure_adam_arena` in env
Draft auto-test init seed	0	`arena.auto_test_draft`
Draft auto-test steps	20	same
Init scale (seed-sampled `x0`)	`N(0, 0.5² I)`	`arena.run_arena` + `auto_test_draft`
Dim range per episode	2–5 (v1)	`server/landscapeforge_environment.py`
Sandbox init timeout	1.0 s	`sandbox.compile_optimizer`
Sandbox step timeout	0.5 s	same
Reward weights	w_regret=1.0, w_conv=0.3, w_robust=0.3, w_novelty=0.1, w_budget=0.05, w_evalfail=0.5	`rewards.py`
Novelty gate	Applied only when `r_regret > 0.5`	`rewards.NOVELTY_GATE`
`PHI_SCALE` (potential normalizer)	10.0	`rewards.py`
`COMPILE_PENALTY_SIGNAL`	-0.1	`rewards.py`
Tier menus	T0: quadratic/styblinski_tang/huber · T1: +gaussian_mix/himmelblau · T2: +rosenbrock/stiff_quadratic/plateau/cliff	`landscapes.TIER_MENU`
Quadratic cond-number cap per tier	T0: 100, T1: 1000, T2: 10000	`_sample_params` in env

Assumptions / simplifications (v1)

LandscapeForge is a template picker, not a free-form code author. The env internally samples (template, params) uniformly from the active tier's menu — no LandscapeForge LLM adapter in v1. Defers §18 v2 non-differentiability and gradient-source risks.
All gradients are analytic, hand-written per template. No autodiff/JAX, no finite differences. Templates are verified differentiable by construction. 2b. Reward does NOT depend on f_min. v0.2 switched from r_regret = 1 − (my_regret / adam_regret) (which required knowing the global minimum) to a progress-based form: r_regret = clamp(my_progress / adam_progress − 1, −1, +1) where progress = f_initial − f_final is observable per seed. Landscape.f_min is retained only for diagnostics, NOT used in training. This unlocks v3 NN extension (training loss has no knowable minimum).
Only OptCoder has a policy. The OpenEnv Environment here exposes the OptCoder side; LandscapeForge selection is internal.
Single backbone assumption (Qwen2.5-3B base + OptCoder LoRA) is in the design but not in code; training script is not yet implemented.
Sandbox is in-process + SIGALRM timeout. Works on main thread / CPython / POSIX. Known bug: HTTP /step via uvicorn returns 500 because SIGALRM only fires on the main thread; thread-based timeout fix is TODO.
AST strip drops all module-level code except class Optimizer. Imports are also dropped — the sandbox pre-injects np and numpy into globals, so submitted code can use np.* without an import line.
Dim range 2–5 for v1 even though the design allows up to 100. Keeps arena eval fast (~30 ms/episode) and keeps the prompt token budget tight.
Adam baseline for reward normalization is run inside the env on every commit to compute baseline_adam_regret. Cost: one 200-step × 10-seed arena run per episode on top of the OptCoder eval. ~30 ms, acceptable.
AST novelty score uses difflib (character-level Levenshtein-ish) rather than true AST diff. Enough to detect "commit ≈ reference" but not semantically rigorous. Upgrade path noted.
Tier advancement is not auto-wired. env.advance_tier(new_tier) exists as a manual API; rolling-regret-based auto-advance is a trainer-side concern and not yet implemented.

How to run

cd landscapeforge
uv sync                                # installs deps
uv run python tests/test_episode.py    # 3 scripted episodes

# Local dev server with Gradio at /web:
uv run uvicorn landscapeforge.server.app:app --host 127.0.0.1 --port 8000
# → http://localhost:8000/web  (Gradio demo)
# → http://localhost:8000/schema  (OpenEnv schema)

Pushing to HF Spaces

Use --no-interface — our custom Gradio mount at /web replaces OpenEnv's built-in UI (the built-in passes theme= to gr.mount_gradio_app, which Gradio 5.x dropped):

openenv push . --exclude .hfignore --no-interface

Without --no-interface, openenv push injects ENV ENABLE_WEB_INTERFACE=true into the Dockerfile → OpenEnv's create_app routes through create_web_interface_app → crashes on Gradio 5 incompatibility.

Expected output: three ✓ PASSED lines, final line All tests passed.

In-process usage (no server needed):

from landscapeforge.models import LandscapeforgeAction
from landscapeforge.server.landscapeforge_environment import LandscapeforgeEnvironment

env = LandscapeforgeEnvironment(tier="T0", seed=42)
obs = env.reset()
obs = env.step(LandscapeforgeAction(kind="run_baseline", baseline_name="adam"))
obs = env.step(LandscapeforgeAction(kind="draft", code="...Optimizer class..."))
obs = env.step(LandscapeforgeAction(kind="commit"))
print(obs.reward, obs.r_optcoder_breakdown)

HTTP server starts with uv run uvicorn landscapeforge.server.app:app. /reset and /schema work; /step currently returns 500 (see assumption 5).

Known gaps (tracked for next passes)

SFT warm-start corpus: ~200 hand-authored run_baseline → draft → inspect → draft → commit traces (§15 Stage 0)
GRPO training script using TRL + HF transformers
Prompt renderer: format obs into the LLM prompt template from Appendix A
Curriculum auto-advancement (rolling-mean-regret watchdog on top of env.advance_tier)
Gradio demo Space with contour + trajectory animation
Thread-based sandbox timeout to unblock HTTP /step
True AST-diff-based novelty (replace difflib)
Docker image + HF Spaces push

Non-goals (v1)

Free-form LandscapeForge code authoring (deferred to v2 per §18)
Non-differentiable landscape defense (moot while LandscapeForge is template-picker)
Multi-turn LandscapeForge-vs-OptCoder within a single episode (sequential only)
Neural-net-as-landscape Phase-D (v3)

LandscapeForge — Implementation Notes (v0.1 code)

What's implemented

Action space (§7.1)

Reward

Terminal reward (the GRPO training scalar)

1. r_regret — the main signal (Adam-relative descent, NO f_min dependency)

2. r_convergence — speed bonus

3. r_robustness — cross-seed consistency

4. r_novelty — structural departure from references

5. r_budget — penalty for over-spending budget

6. r_eval_failures — crash penalty

Concrete example

Stepwise feedback (NOT training reward)

Constants

Assumptions / simplifications (v1)

How to run

Pushing to HF Spaces

Known gaps (tracked for next passes)

Non-goals (v1)

1. `r_regret` — the main signal (Adam-relative descent, NO `f_min` dependency)

2. `r_convergence` — speed bonus

3. `r_robustness` — cross-seed consistency

4. `r_novelty` — structural departure from references

5. `r_budget` — penalty for over-spending budget

6. `r_eval_failures` — crash penalty