Spaces:

K446
/

Opengrid

Running

App Files Files Community

Opengrid / docs /changes.md

K446

feat: curriculum training + Karnataka scenarios + repo cleanup

8a02303 13 days ago

preview code

raw

history blame contribute delete

8.41 kB

	# Notebook Changes — opengrid_grpo_colab.ipynb

	## Bug fixes applied (2026-04-25)

	### Cell 7 — Generate Training Prompts

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 1 \| 🔴 Critical \| `obs_dict = obs.model_dump()` produces dicts with integer keys; `Dataset.from_dict({"obs_context": obs_contexts})` fails with `ArrowTypeError: Expected dict key of type str or bytes, got 'int'` \| Changed to `json.loads(obs.model_dump_json())` so all keys are strings; then stored as `json.dumps(obs_dict)` — a flat JSON string PyArrow handles trivially \|
	\| 2 \| 🟡 Bug \| `env = OpenGridEnv(task_config)` instantiated before the loop but immediately replaced inside the loop — wasted object creation \| Removed stray instantiation \|
	\| 3 \| 🟡 Bug \| `import copy`, `import json` inside inner loop body — re-imported on every iteration \| Moved to top of cell \|
	\| 4 \| 🟡 Bug \| Slack bus included in random action choices — physics solver overwrites it, wasting action budget \| Filtered to `['generator', 'battery']` only \|

	### Cell 8 — Reward Function

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 5 \| 🔴 Critical \| `reward_fn` received `obs_context` as JSON strings from the dataset column but passed them directly to `compute_grpo_reward` which expects dicts \| Added `json.loads(ctx) if isinstance(ctx, str) else ctx` deserialization before scoring \|
	\| 6 \| 🟡 Bug \| No assertion to catch silent arity mismatches \| Added `assert len(test_rewards) == 2` sanity check \|

	### Cell 9 — Training

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 7 \| 🟡 Bug \| `bf16=torch.cuda.is_bf16_supported()` raises `AssertionError` when CUDA is not available (no GPU runtime) \| Guarded: `_cuda_ok = torch.cuda.is_available()` then `_bf16 = _cuda_ok and ...` \|

	### Cell 12 — Before/After Plot

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 8 \| 🟡 Bug \| Bar labels used `va='bottom'` for all bars; for negative-height bars the label renders inside/below the bar \| Fixed: `va='bottom'` when `h >= 0`, `va='top'` when `h < 0`, with matching y-offset \|

	### Cell 13 — Summary Table

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 9 \| 🟡 Bug \| `common_tasks` was set in Cell 12; if the user skips the plot cell, Cell 13 raises `NameError: common_tasks` \| Rebuilt `common_tasks` defensively at the top of Cell 13 \|

	---

	## `inference.py` — Code review fixes (2026-04-25)

	### High-priority fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 1 \| 🔴 Bug \| `parse_action()` crashes on valid JSON that is not an object (e.g. `[]`) — `AttributeError` not caught by `except (json.JSONDecodeError, KeyError)` \| Rewrote with `isinstance(data, dict)` guard, list-unwrapping, field-type validation, and broad `except Exception` \|
	\| 2 \| 🔴 Bug \| `parse_action()` markdown/prose stripping is fragile — fails on `Here is the action: {...}` \| Extracts first `{...}` substring via `text.find("{")` / `text.rfind("}")` \|
	\| 3 \| 🔴 Reliability \| `/grader` call can exceed `httpx` 30s timeout on first use (lazy `RobustnessGrader` bound estimation) \| `grade()` now uses `timeout=180.0`; base client uses `httpx.Timeout(connect=10, read=60, write=30, pool=10)` \|
	\| 4 \| 🟡 Bug \| `HF_TOKEN` takes precedence over `OPENAI_API_KEY` — if both set with OpenAI endpoint, auth fails \| Changed to `API_KEY or OPENAI_API_KEY or HF_TOKEN` priority order \|
	\| 5 \| 🟡 Bug \| No JSON-mode enforcement for LLM — models return markdown/prose \| Added `response_format={"type": "json_object"}` with fallback for unsupported endpoints \|

	### System prompt fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 6 \| 🟡 Design \| Prompt says slack bus is controllable, but physics solver overwrites it \| Changed to: "avoid adjusting the slack bus — physics overwrites it" \|
	\| 7 \| 🟡 Design \| Single-agent mode allows topology actions without safety layer protection \| Added: "Prefer NO topology actions unless absolutely necessary" \|
	\| 8 \| 🟡 Design \| Multi-agent prompt says "Only for lines in your zone" but observations include boundary lines \| Clarified: "Only for visible internal or boundary lines. Boundary-line switching is risky" \|

	### Multi-agent robustness fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 9 \| 🟡 Bug \| Agent iteration uses `range(num_agents)` — assumes contiguous integer IDs \| Changed to `sorted(observations.keys())` \|
	\| 10 \| 🟡 Bug \| `safety_reports` assumed to be list, but API returns dict keyed by agent ID \| Added `isinstance` check to handle both list and dict formats \|
	\| 11 \| 🟡 Design \| Safety correction feedback not fed back to LLM — model repeats same invalid actions \| Appended `[SAFETY] {reason}` to agent history when corrections occur \|

	### Other fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 12 \| 🟡 Bug \| `MAX_STEPS = 50` hardcoded — may truncate future tasks \| Changed to `MAX_STEPS = 100` as safety cap; `done` flag is the true terminator \|
	\| 13 \| 🟡 Bug \| Default task list excludes `task_karnataka` despite KPTCL multi-agent framing \| Added `task_karnataka` to `TASKS` list \|
	\| 14 \| 🟡 Bug \| Module docstring says all 3 env vars are required; only API key is \| Fixed docstring to document defaults and actual requirements \|
	\| 15 \| 🟡 Bug \| `[END]` log prints score at `.2f` but summary prints `.4f` — precision loss \| Changed `log_end` to use `:.4f` \|
	\| 16 \| 🟡 Reliability \| `OpenAI()` client has no timeout or retry config \| Added `timeout=30.0, max_retries=2` \|
	\| 17 \| 🟢 Feature \| No `list_tasks()` method on `EnvClient` \| Added `list_tasks()` for future task validation \|

	---

	## GRPO Training — Environment-Grounded Rewards (2026-04-25)

	### Root Cause: Proxy Reward Disconnect

	The original `compute_grpo_reward` was a heuristic proxy scorer that evaluated JSON format, direction, and proportionality without ever stepping the environment. The model optimized this proxy, which did not correlate with actual grid physics rewards. Result: zero improvement over baseline.

	### Changes Made

	#### `src/environment.py`

	\| # \| Change \| Purpose \|
	\|---\|--------\|---------\|
	\| 1 \| Added `_set_state(obs_dict)` method to `OpenGridEnv` \| Enables restoring environment to any observed state for reward computation. Rebuilds bus/line state, frequency, and slack injection from observation dicts. \|

	#### `training/train_grpo.py`

	\| # \| Severity \| Change \| Details \|
	\|---\|----------\|--------\|---------\|
	\| 2 \| 🔴 Critical \| Replaced `compute_grpo_reward` with `compute_grpo_reward_env` \| New reward function actually steps the physics simulation: restores env state → steps with LLM action → measures real reward → runs mini-rollout with heuristic continuation for trajectory awareness \|
	\| 3 \| 🔴 Critical \| Added mini-rollout scoring (horizon=3) \| After the LLM's action, runs 2 more steps with heuristic policy to capture trajectory-level impact. Combines: `immediate_reward + 0.5 * rollout_reward` \|
	\| 4 \| 🟡 Medium \| Increased `num_generations` from 4 → 8 \| Wider GRPO group = more reward variance = stronger ranking signal. Prevents the advantage calculation from collapsing to zero. \|
	\| 5 \| 🟡 Medium \| Increased random perturbation range from ±15 → ±30 MW \| Creates more diverse/stressed grid states during training data generation. Model sees near-blackout and overload scenarios. \|
	\| 6 \| 🟡 Medium \| Added adversarial battery drain (every 5th episode) \| Forces model to learn actions when batteries are near-empty — a critical edge case the original data lacked. \|
	\| 7 \| 🟡 Medium \| Multi-bus perturbations (1-2 buses per step) \| Was single-bus. More diverse action patterns create richer state transitions. \|
	\| 8 \| 🟡 Medium \| Increased learning rate from 5e-6 → 1e-5 \| Slightly more aggressive to capitalize on the now-meaningful reward signal. \|
	\| 9 \| 🟡 Medium \| Increased gradient accumulation (effective batch 16) \| Smoother gradients for more stable training. \|
	\| 10 \| 🟡 Medium \| Steps per episode increased from 10 → 15 \| More temporal diversity in observations. \|
	\| 11 \| 🟢 Minor \| obs_context stored as JSON string \| Fixes Arrow serialization (PyArrow can't handle dicts with int keys). \|
	\| 12 \| 🟢 Minor \| Kept legacy `compute_grpo_reward` for test-mode compat \| Backward compatibility with `--test-mode` pipeline verification. \|

	# Notebook Changes — opengrid_grpo_colab.ipynb

	## Bug fixes applied (2026-04-25)

	### Cell 7 — Generate Training Prompts

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 1 \| 🔴 Critical \| `obs_dict = obs.model_dump()` produces dicts with integer keys; `Dataset.from_dict({"obs_context": obs_contexts})` fails with `ArrowTypeError: Expected dict key of type str or bytes, got 'int'` \| Changed to `json.loads(obs.model_dump_json())` so all keys are strings; then stored as `json.dumps(obs_dict)` — a flat JSON string PyArrow handles trivially \|
	\| 2 \| 🟡 Bug \| `env = OpenGridEnv(task_config)` instantiated before the loop but immediately replaced inside the loop — wasted object creation \| Removed stray instantiation \|
	\| 3 \| 🟡 Bug \| `import copy`, `import json` inside inner loop body — re-imported on every iteration \| Moved to top of cell \|
	\| 4 \| 🟡 Bug \| Slack bus included in random action choices — physics solver overwrites it, wasting action budget \| Filtered to `['generator', 'battery']` only \|

	### Cell 8 — Reward Function

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 5 \| 🔴 Critical \| `reward_fn` received `obs_context` as JSON strings from the dataset column but passed them directly to `compute_grpo_reward` which expects dicts \| Added `json.loads(ctx) if isinstance(ctx, str) else ctx` deserialization before scoring \|
	\| 6 \| 🟡 Bug \| No assertion to catch silent arity mismatches \| Added `assert len(test_rewards) == 2` sanity check \|

	### Cell 9 — Training

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 7 \| 🟡 Bug \| `bf16=torch.cuda.is_bf16_supported()` raises `AssertionError` when CUDA is not available (no GPU runtime) \| Guarded: `_cuda_ok = torch.cuda.is_available()` then `_bf16 = _cuda_ok and ...` \|

	### Cell 12 — Before/After Plot

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 8 \| 🟡 Bug \| Bar labels used `va='bottom'` for all bars; for negative-height bars the label renders inside/below the bar \| Fixed: `va='bottom'` when `h >= 0`, `va='top'` when `h < 0`, with matching y-offset \|

	### Cell 13 — Summary Table

	\| # \| Severity \| Bug \| Fix \|
	\|---\|----------\|-----\|-----\|
	\| 9 \| 🟡 Bug \| `common_tasks` was set in Cell 12; if the user skips the plot cell, Cell 13 raises `NameError: common_tasks` \| Rebuilt `common_tasks` defensively at the top of Cell 13 \|

	---

	## `inference.py` — Code review fixes (2026-04-25)

	### High-priority fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 1 \| 🔴 Bug \| `parse_action()` crashes on valid JSON that is not an object (e.g. `[]`) — `AttributeError` not caught by `except (json.JSONDecodeError, KeyError)` \| Rewrote with `isinstance(data, dict)` guard, list-unwrapping, field-type validation, and broad `except Exception` \|
	\| 2 \| 🔴 Bug \| `parse_action()` markdown/prose stripping is fragile — fails on `Here is the action: {...}` \| Extracts first `{...}` substring via `text.find("{")` / `text.rfind("}")` \|
	\| 3 \| 🔴 Reliability \| `/grader` call can exceed `httpx` 30s timeout on first use (lazy `RobustnessGrader` bound estimation) \| `grade()` now uses `timeout=180.0`; base client uses `httpx.Timeout(connect=10, read=60, write=30, pool=10)` \|
	\| 4 \| 🟡 Bug \| `HF_TOKEN` takes precedence over `OPENAI_API_KEY` — if both set with OpenAI endpoint, auth fails \| Changed to `API_KEY or OPENAI_API_KEY or HF_TOKEN` priority order \|
	\| 5 \| 🟡 Bug \| No JSON-mode enforcement for LLM — models return markdown/prose \| Added `response_format={"type": "json_object"}` with fallback for unsupported endpoints \|

	### System prompt fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 6 \| 🟡 Design \| Prompt says slack bus is controllable, but physics solver overwrites it \| Changed to: "avoid adjusting the slack bus — physics overwrites it" \|
	\| 7 \| 🟡 Design \| Single-agent mode allows topology actions without safety layer protection \| Added: "Prefer NO topology actions unless absolutely necessary" \|
	\| 8 \| 🟡 Design \| Multi-agent prompt says "Only for lines in your zone" but observations include boundary lines \| Clarified: "Only for visible internal or boundary lines. Boundary-line switching is risky" \|

	### Multi-agent robustness fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 9 \| 🟡 Bug \| Agent iteration uses `range(num_agents)` — assumes contiguous integer IDs \| Changed to `sorted(observations.keys())` \|
	\| 10 \| 🟡 Bug \| `safety_reports` assumed to be list, but API returns dict keyed by agent ID \| Added `isinstance` check to handle both list and dict formats \|
	\| 11 \| 🟡 Design \| Safety correction feedback not fed back to LLM — model repeats same invalid actions \| Appended `[SAFETY] {reason}` to agent history when corrections occur \|

	### Other fixes

	\| # \| Severity \| Issue \| Fix \|
	\|---\|----------\|-------\|-----\|
	\| 12 \| 🟡 Bug \| `MAX_STEPS = 50` hardcoded — may truncate future tasks \| Changed to `MAX_STEPS = 100` as safety cap; `done` flag is the true terminator \|
	\| 13 \| 🟡 Bug \| Default task list excludes `task_karnataka` despite KPTCL multi-agent framing \| Added `task_karnataka` to `TASKS` list \|
	\| 14 \| 🟡 Bug \| Module docstring says all 3 env vars are required; only API key is \| Fixed docstring to document defaults and actual requirements \|
	\| 15 \| 🟡 Bug \| `[END]` log prints score at `.2f` but summary prints `.4f` — precision loss \| Changed `log_end` to use `:.4f` \|
	\| 16 \| 🟡 Reliability \| `OpenAI()` client has no timeout or retry config \| Added `timeout=30.0, max_retries=2` \|
	\| 17 \| 🟢 Feature \| No `list_tasks()` method on `EnvClient` \| Added `list_tasks()` for future task validation \|

	---

	## GRPO Training — Environment-Grounded Rewards (2026-04-25)

	### Root Cause: Proxy Reward Disconnect

	The original `compute_grpo_reward` was a heuristic proxy scorer that evaluated JSON format, direction, and proportionality without ever stepping the environment. The model optimized this proxy, which did not correlate with actual grid physics rewards. Result: zero improvement over baseline.

	### Changes Made

	#### `src/environment.py`

	\| # \| Change \| Purpose \|
	\|---\|--------\|---------\|
	\| 1 \| Added `_set_state(obs_dict)` method to `OpenGridEnv` \| Enables restoring environment to any observed state for reward computation. Rebuilds bus/line state, frequency, and slack injection from observation dicts. \|

	#### `training/train_grpo.py`

	\| # \| Severity \| Change \| Details \|
	\|---\|----------\|--------\|---------\|
	\| 2 \| 🔴 Critical \| Replaced `compute_grpo_reward` with `compute_grpo_reward_env` \| New reward function actually steps the physics simulation: restores env state → steps with LLM action → measures real reward → runs mini-rollout with heuristic continuation for trajectory awareness \|
	\| 3 \| 🔴 Critical \| Added mini-rollout scoring (horizon=3) \| After the LLM's action, runs 2 more steps with heuristic policy to capture trajectory-level impact. Combines: `immediate_reward + 0.5 * rollout_reward` \|
	\| 4 \| 🟡 Medium \| Increased `num_generations` from 4 → 8 \| Wider GRPO group = more reward variance = stronger ranking signal. Prevents the advantage calculation from collapsing to zero. \|
	\| 5 \| 🟡 Medium \| Increased random perturbation range from ±15 → ±30 MW \| Creates more diverse/stressed grid states during training data generation. Model sees near-blackout and overload scenarios. \|
	\| 6 \| 🟡 Medium \| Added adversarial battery drain (every 5th episode) \| Forces model to learn actions when batteries are near-empty — a critical edge case the original data lacked. \|
	\| 7 \| 🟡 Medium \| Multi-bus perturbations (1-2 buses per step) \| Was single-bus. More diverse action patterns create richer state transitions. \|
	\| 8 \| 🟡 Medium \| Increased learning rate from 5e-6 → 1e-5 \| Slightly more aggressive to capitalize on the now-meaningful reward signal. \|
	\| 9 \| 🟡 Medium \| Increased gradient accumulation (effective batch 16) \| Smoother gradients for more stable training. \|
	\| 10 \| 🟡 Medium \| Steps per episode increased from 10 → 15 \| More temporal diversity in observations. \|
	\| 11 \| 🟢 Minor \| obs_context stored as JSON string \| Fixes Arrow serialization (PyArrow can't handle dicts with int keys). \|
	\| 12 \| 🟢 Minor \| Kept legacy `compute_grpo_reward` for test-mode compat \| Backward compatibility with `--test-mode` pipeline verification. \|