Opengrid / docs /changes.md
K446's picture
feat: curriculum training + Karnataka scenarios + repo cleanup
8a02303
|
raw
history blame
8.41 kB

Notebook Changes — opengrid_grpo_colab.ipynb

Bug fixes applied (2026-04-25)

Cell 7 — Generate Training Prompts

# Severity Bug Fix
1 🔴 Critical obs_dict = obs.model_dump() produces dicts with integer keys; Dataset.from_dict({"obs_context": obs_contexts}) fails with ArrowTypeError: Expected dict key of type str or bytes, got 'int' Changed to json.loads(obs.model_dump_json()) so all keys are strings; then stored as json.dumps(obs_dict) — a flat JSON string PyArrow handles trivially
2 🟡 Bug env = OpenGridEnv(task_config) instantiated before the loop but immediately replaced inside the loop — wasted object creation Removed stray instantiation
3 🟡 Bug import copy, import json inside inner loop body — re-imported on every iteration Moved to top of cell
4 🟡 Bug Slack bus included in random action choices — physics solver overwrites it, wasting action budget Filtered to ['generator', 'battery'] only

Cell 8 — Reward Function

# Severity Bug Fix
5 🔴 Critical reward_fn received obs_context as JSON strings from the dataset column but passed them directly to compute_grpo_reward which expects dicts Added json.loads(ctx) if isinstance(ctx, str) else ctx deserialization before scoring
6 🟡 Bug No assertion to catch silent arity mismatches Added assert len(test_rewards) == 2 sanity check

Cell 9 — Training

# Severity Bug Fix
7 🟡 Bug bf16=torch.cuda.is_bf16_supported() raises AssertionError when CUDA is not available (no GPU runtime) Guarded: _cuda_ok = torch.cuda.is_available() then _bf16 = _cuda_ok and ...

Cell 12 — Before/After Plot

# Severity Bug Fix
8 🟡 Bug Bar labels used va='bottom' for all bars; for negative-height bars the label renders inside/below the bar Fixed: va='bottom' when h >= 0, va='top' when h < 0, with matching y-offset

Cell 13 — Summary Table

# Severity Bug Fix
9 🟡 Bug common_tasks was set in Cell 12; if the user skips the plot cell, Cell 13 raises NameError: common_tasks Rebuilt common_tasks defensively at the top of Cell 13

inference.py — Code review fixes (2026-04-25)

High-priority fixes

# Severity Issue Fix
1 🔴 Bug parse_action() crashes on valid JSON that is not an object (e.g. []) — AttributeError not caught by except (json.JSONDecodeError, KeyError) Rewrote with isinstance(data, dict) guard, list-unwrapping, field-type validation, and broad except Exception
2 🔴 Bug parse_action() markdown/prose stripping is fragile — fails on Here is the action: {...} Extracts first {...} substring via text.find("{") / text.rfind("}")
3 🔴 Reliability /grader call can exceed httpx 30s timeout on first use (lazy RobustnessGrader bound estimation) grade() now uses timeout=180.0; base client uses httpx.Timeout(connect=10, read=60, write=30, pool=10)
4 🟡 Bug HF_TOKEN takes precedence over OPENAI_API_KEY — if both set with OpenAI endpoint, auth fails Changed to API_KEY or OPENAI_API_KEY or HF_TOKEN priority order
5 🟡 Bug No JSON-mode enforcement for LLM — models return markdown/prose Added response_format={"type": "json_object"} with fallback for unsupported endpoints

System prompt fixes

# Severity Issue Fix
6 🟡 Design Prompt says slack bus is controllable, but physics solver overwrites it Changed to: "avoid adjusting the slack bus — physics overwrites it"
7 🟡 Design Single-agent mode allows topology actions without safety layer protection Added: "Prefer NO topology actions unless absolutely necessary"
8 🟡 Design Multi-agent prompt says "Only for lines in your zone" but observations include boundary lines Clarified: "Only for visible internal or boundary lines. Boundary-line switching is risky"

Multi-agent robustness fixes

# Severity Issue Fix
9 🟡 Bug Agent iteration uses range(num_agents) — assumes contiguous integer IDs Changed to sorted(observations.keys())
10 🟡 Bug safety_reports assumed to be list, but API returns dict keyed by agent ID Added isinstance check to handle both list and dict formats
11 🟡 Design Safety correction feedback not fed back to LLM — model repeats same invalid actions Appended [SAFETY] {reason} to agent history when corrections occur

Other fixes

# Severity Issue Fix
12 🟡 Bug MAX_STEPS = 50 hardcoded — may truncate future tasks Changed to MAX_STEPS = 100 as safety cap; done flag is the true terminator
13 🟡 Bug Default task list excludes task_karnataka despite KPTCL multi-agent framing Added task_karnataka to TASKS list
14 🟡 Bug Module docstring says all 3 env vars are required; only API key is Fixed docstring to document defaults and actual requirements
15 🟡 Bug [END] log prints score at .2f but summary prints .4f — precision loss Changed log_end to use :.4f
16 🟡 Reliability OpenAI() client has no timeout or retry config Added timeout=30.0, max_retries=2
17 🟢 Feature No list_tasks() method on EnvClient Added list_tasks() for future task validation

GRPO Training — Environment-Grounded Rewards (2026-04-25)

Root Cause: Proxy Reward Disconnect

The original compute_grpo_reward was a heuristic proxy scorer that evaluated JSON format, direction, and proportionality without ever stepping the environment. The model optimized this proxy, which did not correlate with actual grid physics rewards. Result: zero improvement over baseline.

Changes Made

src/environment.py

# Change Purpose
1 Added _set_state(obs_dict) method to OpenGridEnv Enables restoring environment to any observed state for reward computation. Rebuilds bus/line state, frequency, and slack injection from observation dicts.

training/train_grpo.py

# Severity Change Details
2 🔴 Critical Replaced compute_grpo_reward with compute_grpo_reward_env New reward function actually steps the physics simulation: restores env state → steps with LLM action → measures real reward → runs mini-rollout with heuristic continuation for trajectory awareness
3 🔴 Critical Added mini-rollout scoring (horizon=3) After the LLM's action, runs 2 more steps with heuristic policy to capture trajectory-level impact. Combines: immediate_reward + 0.5 * rollout_reward
4 🟡 Medium Increased num_generations from 4 → 8 Wider GRPO group = more reward variance = stronger ranking signal. Prevents the advantage calculation from collapsing to zero.
5 🟡 Medium Increased random perturbation range from ±15 → ±30 MW Creates more diverse/stressed grid states during training data generation. Model sees near-blackout and overload scenarios.
6 🟡 Medium Added adversarial battery drain (every 5th episode) Forces model to learn actions when batteries are near-empty — a critical edge case the original data lacked.
7 🟡 Medium Multi-bus perturbations (1-2 buses per step) Was single-bus. More diverse action patterns create richer state transitions.
8 🟡 Medium Increased learning rate from 5e-6 → 1e-5 Slightly more aggressive to capitalize on the now-meaningful reward signal.
9 🟡 Medium Increased gradient accumulation (effective batch 16) Smoother gradients for more stable training.
10 🟡 Medium Steps per episode increased from 10 → 15 More temporal diversity in observations.
11 🟢 Minor obs_context stored as JSON string Fixes Arrow serialization (PyArrow can't handle dicts with int keys).
12 🟢 Minor Kept legacy compute_grpo_reward for test-mode compat Backward compatibility with --test-mode pipeline verification.