Spaces:
Sleeping
Sleeping
Update docs/hf-mini-blog-ev-grid-oracle.md
Browse files- docs/hf-mini-blog-ev-grid-oracle.md +833 -94
docs/hf-mini-blog-ev-grid-oracle.md
CHANGED
|
@@ -1,94 +1,833 @@
|
|
| 1 |
-
--
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
##
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
---
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
---
|
| 93 |
-
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!-- ============================================================
|
| 2 |
+
EV GRID ORACLE · HF Mini Blog
|
| 3 |
+
Team Codestreak · Nitish R.G., Padmanabhan, Prithic
|
| 4 |
+
============================================================ -->
|
| 5 |
+
|
| 6 |
+
<div align="center">
|
| 7 |
+
|
| 8 |
+
```
|
| 9 |
+
███████╗██╗ ██╗ ██████╗ ██████╗ ██╗██████╗
|
| 10 |
+
██╔════╝██║ ██║ ██╔════╝ ██╔══██╗██║██╔══██╗
|
| 11 |
+
█████╗ ██║ ██║ ██║ ███╗██████╔╝██║██║ ██║
|
| 12 |
+
██╔══╝ ╚██╗ ██╔╝ ██║ ██║██╔══██╗██║██║ ██║
|
| 13 |
+
███████╗ ╚████╔╝ ╚██████╔╝██║ ██║██║██████╔╝
|
| 14 |
+
╚══════╝ ╚═══╝ ╚═════╝ ╚═╝ ╚═╝╚═╝╚═════╝
|
| 15 |
+
|
| 16 |
+
██████╗ ██████╗ █████╗ ██████╗██╗ ███████╗
|
| 17 |
+
██╔═══██╗██╔══██╗██╔══██╗██╔════╝██║ ██╔════╝
|
| 18 |
+
██║ ██║██████╔╝███████║██║ ██║ █████╗
|
| 19 |
+
██║ ██║██╔══██╗██╔══██║██║ ██║ ██╔══╝
|
| 20 |
+
╚██████╔╝██║ ██║██║ ██║╚██████╗███████╗███████╗
|
| 21 |
+
╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═════╝╚══════╝╚══════╝
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
### ⚡ *A Verifiable GRPO Dispatch Oracle for Bangalore's EV Charging Grid* ⚡
|
| 25 |
+
|
| 26 |
+
<br/>
|
| 27 |
+
|
| 28 |
+
[](https://pypi.org/project/openenv-core/)
|
| 29 |
+
[](https://huggingface.co/spaces/NITISHRG15102007/ev-grid-oracle)
|
| 30 |
+
[](https://colab.research.google.com/github/NITISH-R-G/ev-grid-oracle/blob/main/training/train_grpo.ipynb)
|
| 31 |
+
[](https://github.com/NITISH-R-G/ev-grid-oracle)
|
| 32 |
+
[](LICENSE)
|
| 33 |
+
|
| 34 |
+
<br/>
|
| 35 |
+
|
| 36 |
+
> **"Don't explain. Don't hallucinate. Execute — and prove it."**
|
| 37 |
+
|
| 38 |
+
<br/>
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
</div>
|
| 43 |
+
|
| 44 |
+
## 🌆 The City That Never Stops Charging
|
| 45 |
+
|
| 46 |
+
Picture Bangalore at 6:47 PM on a Tuesday.
|
| 47 |
+
|
| 48 |
+
The IT corridors of Whitefield are emptying. Thousands of EVs — cabs, bikes, BMTC feeders — converge on 25 charging stations simultaneously. Feeder substations creak toward their thermal limits. Clean solar energy is being wasted because nobody knows where to send it. Critical vehicles with 4% battery are sitting in queues behind someone's leisurely overnight top-up.
|
| 49 |
+
|
| 50 |
+
**The grid operator has maybe 90 seconds to reroute, defer, and load-shift before the cascade starts.**
|
| 51 |
+
|
| 52 |
+
Today, that decision is made by gut feel, spreadsheets, or rule-of-thumb heuristics. **EV Grid Oracle changes that.** We built a reinforcement-learning agent that takes the real-time grid snapshot, outputs structured executable actions, and improves *by actually running in a verified simulator* — not by reading papers about it.
|
| 53 |
+
|
| 54 |
+
This is not a chatbot that talks about EV routing. This is an agent that *does it*, *proves it*, and *gets better*.
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
<div align="center">
|
| 59 |
+
|
| 60 |
+
## 👥 Team Codestreak
|
| 61 |
+
|
| 62 |
+
</div>
|
| 63 |
+
|
| 64 |
+
| | Builder | Role |
|
| 65 |
+
|---|---|---|
|
| 66 |
+
| ⚡ | **Nitish R.G.** — [LinkedIn](https://www.linkedin.com/in/nitish-r-g-15-10-2007-rgn/) | Team Leader · Architecture · GRPO training |
|
| 67 |
+
| 🔌 | **Padmanabhan Suresh Babu** | Environment design · Reward engineering |
|
| 68 |
+
| 🗺️ | **Prithic** | Road-graph routing · Evaluation harness |
|
| 69 |
+
|
| 70 |
+
---
|
| 71 |
+
|
| 72 |
+
<div align="center">
|
| 73 |
+
|
| 74 |
+
## 🔥 The 3-Second Hook
|
| 75 |
+
|
| 76 |
+
### *What makes this different from every other "AI + EV" project?*
|
| 77 |
+
|
| 78 |
+
</div>
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
┌─────────────────────────────────────────────────────────────────────┐
|
| 82 |
+
│ │
|
| 83 |
+
│ Most projects: Prompt → LLM → Text → 👍 vibes │
|
| 84 |
+
│ │
|
| 85 |
+
│ EV Grid Oracle: Prompt → LLM → Action → Parse → Validate → │
|
| 86 |
+
│ Simulate → Reward Breakdown → GRPO Update → 🔄 │
|
| 87 |
+
│ │
|
| 88 |
+
└─────────────────────────────────────────────────────────────────────┘
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
| Dimension | Typical "AI + EV" Demo | **EV Grid Oracle** |
|
| 92 |
+
|---|---|---|
|
| 93 |
+
| Output type | Natural language explanation | Structured, executable action schema |
|
| 94 |
+
| Verification | LLM-as-judge / vibes | Deterministic simulator + reward verifier |
|
| 95 |
+
| Training signal | None or SFT | GRPO with multi-component verifiable rewards |
|
| 96 |
+
| Anti-cheat | None | Hard constraint flags (teleport detection, etc.) |
|
| 97 |
+
| Reproducibility | "Trust us" | Paired seeds, committed artifacts, auditable JSON |
|
| 98 |
+
| Evidence | Screenshots | PNG plots + JSON stats + TensorBoard logs |
|
| 99 |
+
|
| 100 |
+
**Three things no other team can fake:**
|
| 101 |
+
1. 🧪 **Verifiable world** — every action is parsed → validated → stepped in a simulator → scored with a full reward breakdown
|
| 102 |
+
2. 🔁 **Replayable evidence** — baseline vs oracle evaluated on the *same deterministic seeds*
|
| 103 |
+
3. 📦 **Engineer-grade artifacts** — plots, stats, and logs committed to the repo so judges audit without rerunning anything
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## 🖥️ Live Command Center
|
| 108 |
+
|
| 109 |
+
> *This is what a judge sees the moment they open the Space.*
|
| 110 |
+
|
| 111 |
+

|
| 112 |
+
|
| 113 |
+
The Command Center gives you:
|
| 114 |
+
- **Real-time grid snapshot** — queue lengths, feeder stress levels, renewable availability
|
| 115 |
+
- **Agent action trace** — what the model decided and why, in structured format
|
| 116 |
+
- **Reward component breakdown** — every scalar contribution, live
|
| 117 |
+
- **Episode replay controls** — step forward/back through any decision point
|
| 118 |
+
|
| 119 |
+
---
|
| 120 |
+
|
| 121 |
+
<div align="center">
|
| 122 |
+
|
| 123 |
+
## 🏙️ The Problem: Bangalore's Grid Under Siege
|
| 124 |
+
|
| 125 |
+
</div>
|
| 126 |
+
|
| 127 |
+
Bangalore's charging infrastructure is growing faster than the grid that feeds it. Here's what dispatchers face *every peak window*:
|
| 128 |
+
|
| 129 |
+
```
|
| 130 |
+
┌──────────────────────────────────────────────────────────────┐
|
| 131 |
+
│ 6:00 PM ████████████████████░░░░ QUEUE: 847 EVs waiting │
|
| 132 |
+
│ 6:15 PM ████████████████████████ STRESS: BLR-07 at 94% │
|
| 133 |
+
│ 6:30 PM ████████████████████████ PEAK VIOLATION: -₹12k │
|
| 134 |
+
│ 6:45 PM ██████████████░░░░░░░░░░ SOLAR WINDOW: wasted │
|
| 135 |
+
│ 7:00 PM ████████████████████████ CRITICAL EV: 3% battery │
|
| 136 |
+
└──────────────────────────────────────────────────────────────┘
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
The constraints are real and they **stack**:
|
| 140 |
+
|
| 141 |
+
- 🚗 **Queue pressure** — every minute of wait degrades user trust and SLA scores
|
| 142 |
+
- ⚡ **Feeder thermal limits** — exceeding them risks hardware damage and grid events
|
| 143 |
+
- 🌿 **Renewable windows** — solar/wind availability is time-varying; miss it and you pay the carbon cost
|
| 144 |
+
- 🚨 **Critical EVs** — deferring a 3%-battery vehicle is a safety failure, not a scheduling choice
|
| 145 |
+
- 🔒 **No cheating** — you cannot teleport vehicles, you cannot route to non-adjacent nodes, physics applies
|
| 146 |
+
|
| 147 |
+
A human dispatcher watching five dashboards makes 200+ micro-decisions per hour. **We trained an LLM to make them faster, verifiably.**
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
<div align="center">
|
| 152 |
+
|
| 153 |
+
## 🏗️ Architecture: The Full Loop
|
| 154 |
+
|
| 155 |
+
*From prompt to policy update — nothing hidden.*
|
| 156 |
+
|
| 157 |
+
</div>
|
| 158 |
+
|
| 159 |
+
```mermaid
|
| 160 |
+
flowchart LR
|
| 161 |
+
subgraph HF["🤗 Hugging Face Space: EV Grid Oracle"]
|
| 162 |
+
API["FastAPI server\nserver/app.py"]
|
| 163 |
+
ENV["EVGridEnvironment\nOpenEnv API"]
|
| 164 |
+
CORE["Simulator core\nEVGridCore / RoadCore"]
|
| 165 |
+
REW["Verifier + Reward\nbreakdown + anti-hack"]
|
| 166 |
+
API --> ENV --> CORE --> REW
|
| 167 |
+
end
|
| 168 |
+
|
| 169 |
+
subgraph TRAIN["⚙️ Colab Training"]
|
| 170 |
+
NB["training/train_grpo.ipynb"]
|
| 171 |
+
TRL["TRL GRPOTrainer"]
|
| 172 |
+
UNS["Unsloth runtime\nQLoRA adapters"]
|
| 173 |
+
TB["TensorBoard\nev_oracle_grpo_road"]
|
| 174 |
+
NB --> TRL --> UNS --> TB
|
| 175 |
+
end
|
| 176 |
+
|
| 177 |
+
MODEL["🧠 Qwen2.5-3B + LoRA\nSmall LLM policy"]
|
| 178 |
+
USER["👤 Judge / Operator"]
|
| 179 |
+
|
| 180 |
+
USER -->|"POST /reset, /step"| API
|
| 181 |
+
REW -->|"reward + next obs"| USER
|
| 182 |
+
TRL -->|"sample completions"| MODEL
|
| 183 |
+
MODEL -->|"action text"| TRL
|
| 184 |
+
TRL -->|"reward_funcs → RoadCore.step"| CORE
|
| 185 |
+
CORE -->|"reward breakdown"| TRL
|
| 186 |
+
TRL -->|"update adapters"| MODEL
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
### How the data flows
|
| 190 |
+
|
| 191 |
+
```
|
| 192 |
+
[Grid Snapshot]
|
| 193 |
+
│
|
| 194 |
+
▼
|
| 195 |
+
┌─────────────┐ ┌──────────────┐ ┌─���─────────────┐
|
| 196 |
+
│ LLM Policy │────▶│ Action Parser│────▶│ Constraint │
|
| 197 |
+
│ Qwen2.5-3B │ │ (strict regex│ │ Validator │
|
| 198 |
+
│ + LoRA │ │ + schema) │ │ (no teleport, │
|
| 199 |
+
└─────────────┘ └──────────────┘ │ valid edges) │
|
| 200 |
+
└───────┬───────┘
|
| 201 |
+
│
|
| 202 |
+
▼
|
| 203 |
+
┌────────────────┐
|
| 204 |
+
│ EVGridCore │
|
| 205 |
+
│ Simulator │
|
| 206 |
+
│ (tick advance)│
|
| 207 |
+
└───────┬────────┘
|
| 208 |
+
│
|
| 209 |
+
┌──────────────────┼──────────────────┐
|
| 210 |
+
▼ ▼ ▼
|
| 211 |
+
[wait reward] [grid_stress reward] [renewable reward]
|
| 212 |
+
│ │ │
|
| 213 |
+
└──────────────────┴──────────────────┘
|
| 214 |
+
│
|
| 215 |
+
▼
|
| 216 |
+
[GRPO policy update]
|
| 217 |
+
```
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
<div align="center">
|
| 222 |
+
|
| 223 |
+
## 🎯 What the Agent Can Do
|
| 224 |
+
|
| 225 |
+
### *Two action schemas. Both verifiable. No hallucination tolerated.*
|
| 226 |
+
|
| 227 |
+
</div>
|
| 228 |
+
|
| 229 |
+
### Schema A — Station Routing & Load Shifting
|
| 230 |
+
|
| 231 |
+
> *"Which station? At what rate? Defer by how long?"*
|
| 232 |
+
|
| 233 |
+
```
|
| 234 |
+
╔══════════════════════════════════════════════════════════════╗
|
| 235 |
+
║ EVGridAction Schema ║
|
| 236 |
+
╠══════════════════════════════════════════════════════════════╣
|
| 237 |
+
║ ACTION: route | defer | load_shift ║
|
| 238 |
+
║ STATION: BLR-01 … BLR-25 (or NONE) ║
|
| 239 |
+
║ CHARGE_RATE: slow | fast | ultra_fast ║
|
| 240 |
+
║ DEFER_MINUTES: integer (0 = don't defer) ║
|
| 241 |
+
║ REASON: ≤ 20 words ║
|
| 242 |
+
║ CONFIDENCE: 0.0 – 1.0 ║
|
| 243 |
+
╚══════════════════════════════════════════════════════════════╝
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
**Example valid action:**
|
| 247 |
+
```
|
| 248 |
+
ACTION: load_shift
|
| 249 |
+
STATION: BLR-07
|
| 250 |
+
CHARGE_RATE: slow
|
| 251 |
+
DEFER_MINUTES: 15
|
| 252 |
+
REASON: Feeder stress at 92%; shift to renewable window at T+15.
|
| 253 |
+
CONFIDENCE: 0.87
|
| 254 |
+
```
|
| 255 |
+
|
| 256 |
+
**Example rejected action (anti-cheat catches this):**
|
| 257 |
+
```
|
| 258 |
+
ACTION: route
|
| 259 |
+
STATION: BLR-99 ← ❌ Station doesn't exist
|
| 260 |
+
CHARGE_RATE: warp_speed ← ❌ Invalid rate enum
|
| 261 |
+
CONFIDENCE: 1.5 ← ❌ Out of range
|
| 262 |
+
→ anti_cheat_flag=True, reward penalty applied
|
| 263 |
+
```
|
| 264 |
+
|
| 265 |
+
---
|
| 266 |
+
|
| 267 |
+
### Schema B — Road-Graph Routing (Bangalore OSM)
|
| 268 |
+
|
| 269 |
+
> *"Where do I send this vehicle next? And I can't teleport it."*
|
| 270 |
+
|
| 271 |
+
```
|
| 272 |
+
╔══════════════════════════════════════════════════════════════╗
|
| 273 |
+
║ RoadAction Schema ║
|
| 274 |
+
╠══════════════════════════════════════════════════════════════╣
|
| 275 |
+
║ CURRENT_NODE: <int> (real OSM node in BLR graph) ║
|
| 276 |
+
║ NEXT_NODE: <int> (must be adjacent — no teleporting) ║
|
| 277 |
+
║ REASON: ≤ 20 words ║
|
| 278 |
+
║ CONFIDENCE: 0.0 – 1.0 ║
|
| 279 |
+
╚══════════════════════════════════════════════════════════════╝
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
The road graph constraint is **the hardest part**. An LLM that has never seen Bangalore's road topology must learn, through RL, to only propose valid neighbor hops. Every teleport attempt is caught and penalized. Over training, the oracle learns to *stay on the map*.
|
| 283 |
+
|
| 284 |
+
```
|
| 285 |
+
Bangalore Road Graph (simplified):
|
| 286 |
+
|
| 287 |
+
[Whitefield] ──── [Marathahalli] ──── [Sarjapur]
|
| 288 |
+
│ │ │
|
| 289 |
+
[Varthur] [Bellandur] [HSR Layout]
|
| 290 |
+
│ │ │
|
| 291 |
+
[KR Puram] ──── [Indiranagar] ──── [Koramangala]
|
| 292 |
+
|
| 293 |
+
✅ Marathahalli → Bellandur (adjacent, valid)
|
| 294 |
+
✅ Indiranagar → Koramangala (adjacent, valid)
|
| 295 |
+
❌ Whitefield → Koramangala (not adjacent, TELEPORT FLAG)
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
---
|
| 299 |
+
|
| 300 |
+
<div align="center">
|
| 301 |
+
|
| 302 |
+
## 🌍 Environment Design: OpenEnv-First
|
| 303 |
+
|
| 304 |
+
### *The environment is a first-class citizen, not an afterthought.*
|
| 305 |
+
|
| 306 |
+
</div>
|
| 307 |
+
|
| 308 |
+
### What the model sees (Observation space)
|
| 309 |
+
|
| 310 |
+
Every step, the agent receives a structured text prompt + JSON state:
|
| 311 |
+
|
| 312 |
+
```json
|
| 313 |
+
{
|
| 314 |
+
"tick": 47,
|
| 315 |
+
"stations": {
|
| 316 |
+
"BLR-07": { "queue_length": 14, "feeder_stress": 0.92, "active_chargers": 8 },
|
| 317 |
+
"BLR-12": { "queue_length": 3, "feeder_stress": 0.41, "active_chargers": 2 },
|
| 318 |
+
"..."
|
| 319 |
+
},
|
| 320 |
+
"renewable_score": 0.73,
|
| 321 |
+
"critical_evs": [
|
| 322 |
+
{ "vehicle_id": "KA01AB1234", "battery_pct": 3, "station": "BLR-07" }
|
| 323 |
+
],
|
| 324 |
+
"road_graph": { "current_node": 1042, "neighbors": [1043, 1101, 998] }
|
| 325 |
+
}
|
| 326 |
+
```
|
| 327 |
+
|
| 328 |
+
### Episode lifecycle
|
| 329 |
+
|
| 330 |
+
```
|
| 331 |
+
RESET ──▶ tick=0, random seed, grid initialized
|
| 332 |
+
│
|
| 333 |
+
▼
|
| 334 |
+
STEP 1 ──▶ agent acts ──▶ simulator ticks ──▶ reward computed
|
| 335 |
+
│
|
| 336 |
+
▼
|
| 337 |
+
STEP 2 … STEP N (long-horizon, mistakes compound)
|
| 338 |
+
│
|
| 339 |
+
▼
|
| 340 |
+
DONE ──▶ episode summary, KPI aggregation, artifacts written
|
| 341 |
+
```
|
| 342 |
+
|
| 343 |
+
The key design insight: **this is not single-shot**. The agent must maintain strategy across many steps. A greedy "always route to the closest station" policy looks fine early and catastrophically fails at peak. The oracle learns to plan.
|
| 344 |
+
|
| 345 |
+
### Core API (OpenEnv-compatible)
|
| 346 |
+
|
| 347 |
+
```
|
| 348 |
+
POST /reset → initial observation + episode seed
|
| 349 |
+
POST /step → action → next obs + reward breakdown + done flag
|
| 350 |
+
GET /state → current full simulator state (JSON)
|
| 351 |
+
GET /schema → action schema definition
|
| 352 |
+
GET /health → liveness probe
|
| 353 |
+
```
|
| 354 |
+
|
| 355 |
+
---
|
| 356 |
+
|
| 357 |
+
<div align="center">
|
| 358 |
+
|
| 359 |
+
## 💰 Reward Design: Verifiable, Multi-Component, Anti-Hack
|
| 360 |
+
|
| 361 |
+
### *The reward is where RL wins or loses. We built it to win.*
|
| 362 |
+
|
| 363 |
+
</div>
|
| 364 |
+
|
| 365 |
+
> **If your reward is fuzzy, your agent learns to game it. If it's crisp and hard to fake, RL works.**
|
| 366 |
+
|
| 367 |
+
```
|
| 368 |
+
Total Reward = Σ weighted components (computed by verifier, not LLM)
|
| 369 |
+
|
| 370 |
+
┌──────────────────────────────────────────────────────────┐
|
| 371 |
+
│ Component │ Signal Direction │ What it measures │
|
| 372 |
+
├──────────────────┼──────────────────┼────────────────────┤
|
| 373 |
+
│ wait │ ⬇ penalize │ queue wait time │
|
| 374 |
+
│ grid_stress │ ⬇ penalize │ feeder overload │
|
| 375 |
+
│ peak │ ⬇ penalize │ peak violations │
|
| 376 |
+
│ renewable │ ⬆ reward │ clean window usage │
|
| 377 |
+
│ urgency │ ⬇ penalize │ critical EV defers │
|
| 378 |
+
│ format_valid │ ⬆ shaping │ parseable output │
|
| 379 |
+
│ anti_cheat │ ⬇ hard penalty │ impossible actions │
|
| 380 |
+
└──────────────────────────────────────────────────────────┘
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
### Anti-hack flags (the hard constraints)
|
| 384 |
+
|
| 385 |
+
| Flag | Trigger | Penalty |
|
| 386 |
+
|---|---|---|
|
| 387 |
+
| `teleport_detected` | NEXT_NODE not adjacent to CURRENT_NODE | Hard negative |
|
| 388 |
+
| `invalid_station` | STATION not in BLR-01..BLR-25 | Hard negative |
|
| 389 |
+
| `critical_deferred` | Critical EV (≤5% battery) given DEFER > 0 | Hard negative |
|
| 390 |
+
| `rate_enum_invalid` | CHARGE_RATE outside enum | Format penalty |
|
| 391 |
+
| `confidence_oob` | CONFIDENCE outside [0.0, 1.0] | Format penalty |
|
| 392 |
+
|
| 393 |
+
The anti-hack layer is what separates *learning the task* from *gaming the reward*. Every flag is logged per step, per episode, so you can audit exactly where the agent was trying to cheat.
|
| 394 |
+
|
| 395 |
+
---
|
| 396 |
+
|
| 397 |
+
<div align="center">
|
| 398 |
+
|
| 399 |
+
## 🧠 Training Pipeline
|
| 400 |
+
|
| 401 |
+
### *GRPO + Unsloth + QLoRA on Qwen2.5-3B*
|
| 402 |
+
|
| 403 |
+
</div>
|
| 404 |
+
|
| 405 |
+
### Why GRPO?
|
| 406 |
+
|
| 407 |
+
Standard PPO requires a value network — an entire extra model to estimate baselines. **GRPO (Group Relative Policy Optimization)** computes baselines *within a group of rollouts from the same prompt*. This means:
|
| 408 |
+
- ✅ No separate critic network
|
| 409 |
+
- ✅ Lower memory footprint
|
| 410 |
+
- ✅ Works beautifully with verifiable reward functions
|
| 411 |
+
- ✅ More rollouts per GPU-hour → faster iteration
|
| 412 |
+
|
| 413 |
+
```
|
| 414 |
+
GRPO Training Loop (per prompt batch):
|
| 415 |
+
|
| 416 |
+
Prompt P ──▶ sample K completions ──▶ [A₁, A₂, A₃, A₄]
|
| 417 |
+
│ │ │ │
|
| 418 |
+
r=0.8 r=0.2 r=0.9 r=0.1 (verifier rewards)
|
| 419 |
+
│
|
| 420 |
+
▼
|
| 421 |
+
relative_reward_i = r_i − mean(r)
|
| 422 |
+
│
|
| 423 |
+
▼
|
| 424 |
+
policy gradient update
|
| 425 |
+
(push A₁, A₃ up; push A₂, A₄ down)
|
| 426 |
+
```
|
| 427 |
+
|
| 428 |
+
### Model + adapter stack
|
| 429 |
+
|
| 430 |
+
```
|
| 431 |
+
Base model: Qwen2.5-3B (3 billion params, fits in Colab T4)
|
| 432 |
+
Adapter: QLoRA (4-bit quantized base + 16-bit LoRA)
|
| 433 |
+
Runtime: Unsloth (2×+ throughput vs vanilla HF)
|
| 434 |
+
Trainer: TRL GRPOTrainer
|
| 435 |
+
Logging: TensorBoard (ev_oracle_grpo_road/)
|
| 436 |
+
```
|
| 437 |
+
|
| 438 |
+
### The winning tip (from hard experience)
|
| 439 |
+
|
| 440 |
+
```
|
| 441 |
+
❌ Wrong approach: one 12-hour mega-run, pray it converges
|
| 442 |
+
✅ Right approach: many 45-min runs with reward component iteration
|
| 443 |
+
|
| 444 |
+
What to iterate:
|
| 445 |
+
1. reward component weights
|
| 446 |
+
2. anti-hack thresholds
|
| 447 |
+
3. scenario curriculum (easy → hard seeds)
|
| 448 |
+
4. rollout batch size (throughput dominates RL wall-clock time)
|
| 449 |
+
```
|
| 450 |
+
|
| 451 |
+
### Training in 3 commands
|
| 452 |
+
|
| 453 |
+
```bash
|
| 454 |
+
# 1. Clone and set up
|
| 455 |
+
git clone https://github.com/NITISH-R-G/ev-grid-oracle
|
| 456 |
+
pip install openenv-core trl unsloth
|
| 457 |
+
|
| 458 |
+
# 2. Open the notebook
|
| 459 |
+
jupyter notebook training/train_grpo.ipynb
|
| 460 |
+
|
| 461 |
+
# 3. Export evidence plots after training
|
| 462 |
+
python tools/export_grpo_tensorboard_plots.py \
|
| 463 |
+
--logdir ev_oracle_grpo_road \
|
| 464 |
+
--out-dir artifacts
|
| 465 |
+
```
|
| 466 |
+
|
| 467 |
+
---
|
| 468 |
+
|
| 469 |
+
<div align="center">
|
| 470 |
+
|
| 471 |
+
## 📊 Evidence: Results That Judges Can Audit
|
| 472 |
+
|
| 473 |
+
### *Every number is committed. Every plot is reproducible.*
|
| 474 |
+
|
| 475 |
+
</div>
|
| 476 |
+
|
| 477 |
+
### Paired evaluation results
|
| 478 |
+
|
| 479 |
+
> `paired_same_world=true, episodes=72` — baseline and oracle ran on *identical seeds*. No cherry-picking.
|
| 480 |
+
|
| 481 |
+
```
|
| 482 |
+
┌───────────────────────────┬──────────────┬──────────────┐
|
| 483 |
+
│ KPI │ Baseline │ Oracle │
|
| 484 |
+
├───────────────────────────┼──────────────┼──────────────┤
|
| 485 |
+
│ avg_wait_minutes │ --- │ 0.2939 │
|
| 486 |
+
│ grid_stress_events │ --- │ 10.3194 │
|
| 487 |
+
│ peak_violations │ --- │ 5.6528 │
|
| 488 |
+
│ renewable_mean │ --- │ 0.3625 │
|
| 489 |
+
│ critical_deferred │ --- │ 0 ✅ │
|
| 490 |
+
│ anti_cheat_steps │ --- │ 2.6389 │
|
| 491 |
+
└───────────────────────────┴──────────────┴──────────────┘
|
| 492 |
+
|
| 493 |
+
Key: critical_deferred = 0 means no safety failures across 72 episodes.
|
| 494 |
+
```
|
| 495 |
+
|
| 496 |
+
### Fair eval (n=25 episodes, Wilson + McNemar)
|
| 497 |
+
|
| 498 |
+
From `artifacts/fair_eval_results.json`:
|
| 499 |
+
- Wilson confidence intervals computed for all binary outcomes
|
| 500 |
+
- McNemar paired test p-values committed for auditable significance testing
|
| 501 |
+
|
| 502 |
+
---
|
| 503 |
+
|
| 504 |
+
<div align="center">
|
| 505 |
+
|
| 506 |
+
## 🖼️ Visualization Gallery
|
| 507 |
+
|
| 508 |
+
*Everything below is auto-generated, committed, and auditable.*
|
| 509 |
+
|
| 510 |
+
</div>
|
| 511 |
+
|
| 512 |
+
---
|
| 513 |
+
|
| 514 |
+
### 📋 One-Page Dashboard — The Full Picture
|
| 515 |
+
|
| 516 |
+
> *Six panels. One glance. All the signal you need.*
|
| 517 |
+
|
| 518 |
+

|
| 519 |
+
|
| 520 |
+
---
|
| 521 |
+
|
| 522 |
+
### 📊 Aggregate KPI Comparison — Baseline vs Oracle
|
| 523 |
+
|
| 524 |
+
> *Mean KPIs across all 72 paired episodes.*
|
| 525 |
+
|
| 526 |
+

|
| 527 |
+
|
| 528 |
+
---
|
| 529 |
+
|
| 530 |
+
### 📈 Per-Episode Trajectories — Wait, Peaks, Stress Over Time
|
| 531 |
+
|
| 532 |
+
> *Each line is one episode. Same seeds, both policies.*
|
| 533 |
+
|
| 534 |
+

|
| 535 |
+
|
| 536 |
+
---
|
| 537 |
+
|
| 538 |
+
### 📉 Paired Deltas — Oracle Minus Baseline
|
| 539 |
+
|
| 540 |
+
> *Negative delta on wait/stress = oracle wins. Positive on renewable = oracle wins.*
|
| 541 |
+
|
| 542 |
+

|
| 543 |
+
|
| 544 |
+
---
|
| 545 |
+
|
| 546 |
+
### 🧩 Reward Component Breakdown
|
| 547 |
+
|
| 548 |
+
> *Where is the reward coming from? Not a black box.*
|
| 549 |
+
|
| 550 |
+

|
| 551 |
+
|
| 552 |
+
```
|
| 553 |
+
Interpreting this chart:
|
| 554 |
+
┌──────────────────────────────────────────────────────┐
|
| 555 |
+
│ wait bar: lower = oracle reduced queues │
|
| 556 |
+
│ grid_stress bar: lower = fewer overload events │
|
| 557 |
+
│ renewable bar: higher = more clean energy used │
|
| 558 |
+
│ urgency bar: near-zero = no critical deferrals │
|
| 559 |
+
│ anti_cheat bar: near-zero = agent learned physics │
|
| 560 |
+
└──────────────────────────────────────────────────────┘
|
| 561 |
+
```
|
| 562 |
+
|
| 563 |
+
---
|
| 564 |
+
|
| 565 |
+
### 📦 Distribution Boxplots — Variance Matters
|
| 566 |
+
|
| 567 |
+
> *Medians and spreads, not just means. Robust policies have tight boxes.*
|
| 568 |
+
|
| 569 |
+

|
| 570 |
+
|
| 571 |
+
---
|
| 572 |
+
|
| 573 |
+
### 🥊 Head-to-Head Win Rates
|
| 574 |
+
|
| 575 |
+
> *On what fraction of episodes does oracle beat baseline on each KPI?*
|
| 576 |
+
|
| 577 |
+

|
| 578 |
+
|
| 579 |
+
---
|
| 580 |
+
|
| 581 |
+
### 🎯 Paired Scatter — Oracle vs Baseline Wait Time
|
| 582 |
+
|
| 583 |
+
> *Points above the diagonal = oracle is worse. Points below = oracle wins.*
|
| 584 |
+
|
| 585 |
+

|
| 586 |
+
|
| 587 |
+
---
|
| 588 |
+
|
| 589 |
+
### 🗓️ Binary Timeline — When Is Baseline Struggling?
|
| 590 |
+
|
| 591 |
+
> *Where on the timeline does baseline fail most? That's where oracle improves the most.*
|
| 592 |
+
|
| 593 |
+

|
| 594 |
+
|
| 595 |
+
---
|
| 596 |
+
|
| 597 |
+
### 📐 Wilson Intervals — Uncertainty-Aware Binary Rates
|
| 598 |
+
|
| 599 |
+
> *Not just point estimates. Confidence intervals for every binary outcome.*
|
| 600 |
+
|
| 601 |
+

|
| 602 |
+
|
| 603 |
+
---
|
| 604 |
+
|
| 605 |
+
### 🔬 McNemar p-values — Statistical Significance
|
| 606 |
+
|
| 607 |
+
> *Paired hypothesis testing. If p < 0.05, the difference is real, not noise.*
|
| 608 |
+
|
| 609 |
+

|
| 610 |
+
|
| 611 |
+
---
|
| 612 |
+
|
| 613 |
+
### 📉 GRPO Training Curves
|
| 614 |
+
|
| 615 |
+
> *The two plots every judge looks for: reward goes up, loss goes down.*
|
| 616 |
+
|
| 617 |
+

|
| 618 |
+
|
| 619 |
+

|
| 620 |
+
|
| 621 |
+
---
|
| 622 |
+
|
| 623 |
+
### ✅ Fair Eval Summary Chart
|
| 624 |
+
|
| 625 |
+

|
| 626 |
+
|
| 627 |
+
---
|
| 628 |
+
|
| 629 |
+
<div align="center">
|
| 630 |
+
|
| 631 |
+
## 💥 Business Impact
|
| 632 |
+
|
| 633 |
+
### *This is not a toy. This is operational AI.*
|
| 634 |
+
|
| 635 |
+
</div>
|
| 636 |
+
|
| 637 |
+
```
|
| 638 |
+
Every 1 minute of average wait reduced across 847 peak-hour EVs
|
| 639 |
+
= 847 minutes of human time saved per peak window
|
| 640 |
+
= ~14 hours of productive time returned to Bangalore, daily.
|
| 641 |
+
|
| 642 |
+
Every peak violation avoided
|
| 643 |
+
= avoided SLA penalty + avoided feeder hardware stress.
|
| 644 |
+
|
| 645 |
+
Every renewable window captured
|
| 646 |
+
= direct carbon savings + lower marginal electricity cost.
|
| 647 |
+
```
|
| 648 |
+
|
| 649 |
+
| Impact Vector | Mechanism | Who benefits |
|
| 650 |
+
|---|---|---|
|
| 651 |
+
| **Lower wait time** | Load-shift to uncongested stations during demand spikes | EV drivers, fleet operators |
|
| 652 |
+
| **Fewer peak violations** | Proactive deferral before thermal limits hit | Grid operators, BESCOM |
|
| 653 |
+
| **More renewable usage** | Route slow-charging EVs into solar/wind windows | Environment, cost-payers |
|
| 654 |
+
| **Zero critical deferrals** | Hard constraint: never defer ≤5% battery | Safety, drivers |
|
| 655 |
+
| **Auditable decisions** | Structured actions + reward breakdown logged | Regulators, operators |
|
| 656 |
+
|
| 657 |
+
---
|
| 658 |
+
|
| 659 |
+
<div align="center">
|
| 660 |
+
|
| 661 |
+
## 🔬 Research Foundation
|
| 662 |
+
|
| 663 |
+
### *We read the papers. Then we implemented them.*
|
| 664 |
+
|
| 665 |
+
</div>
|
| 666 |
+
|
| 667 |
+
```
|
| 668 |
+
┌─────────────────────────────────────────────────────────────────┐
|
| 669 |
+
│ Paper │ What we took │
|
| 670 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 671 |
+
│ DeepSeekMath (GRPO) │ Group-relative baseline, no critic │
|
| 672 |
+
│ arXiv:2402.03300 │ Memory-lean PPO variant for verif. │
|
| 673 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 674 |
+
│ QLoRA (Dettmers et al.) │ 4-bit quantized base + 16-bit LoRA │
|
| 675 |
+
│ arXiv:2305.14314 │ Fast iteration on Colab T4 │
|
| 676 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 677 |
+
│ RLVR direction │ Verifiable rewards reduce judge │
|
| 678 |
+
│ arXiv:2601.18533 │ hacking, improve signal quality │
|
| 679 |
+
├─────────────────────────────────────────────────────────────────┤
|
| 680 |
+
│ OpenEnv RFC-004 │ Composable reward rubrics, │
|
| 681 |
+
│ (meta-pytorch/OpenEnv) │ trajectory scoring, delayed rewards │
|
| 682 |
+
└─────────────────────────────────────────────────────────────────┘
|
| 683 |
+
```
|
| 684 |
+
|
| 685 |
+
**The core insight we borrowed from all of them:**
|
| 686 |
+
|
| 687 |
+
> Sample multiple completions per prompt → score them with a verifiable function → update the policy to prefer what actually worked. Don't let the LLM grade its own homework.
|
| 688 |
+
|
| 689 |
+
---
|
| 690 |
+
|
| 691 |
+
<div align="center">
|
| 692 |
+
|
| 693 |
+
## 🛠️ Tech Stack Deep Dive
|
| 694 |
+
|
| 695 |
+
</div>
|
| 696 |
+
|
| 697 |
+
```
|
| 698 |
+
┌───────────────────────────────────────────────────────────────┐
|
| 699 |
+
│ EV GRID ORACLE STACK │
|
| 700 |
+
├────────────────┬──────────────────────────────────────────────┤
|
| 701 |
+
│ Layer │ Technology │
|
| 702 |
+
├────────────────┼──────────────────────────────────────────────┤
|
| 703 |
+
│ Interface │ OpenEnv 0.2.3 (reset/step/state/schema) │
|
| 704 |
+
│ API server │ FastAPI (async, auto-docs, Pydantic) │
|
| 705 |
+
│ Hosting │ Hugging Face Spaces (Docker, port 8000) │
|
| 706 |
+
│ Simulator │ EVGridCore + RoadCore (pure Python) │
|
| 707 |
+
│ Training │ TRL GRPOTrainer + GRPOConfig │
|
| 708 |
+
│ Base model │ Qwen2.5-3B (instruction-tuned) │
|
| 709 |
+
│ Adapters │ QLoRA (4-bit base, 16-bit adapters) │
|
| 710 |
+
│ Runtime │ Unsloth (2×+ throughput on T4) │
|
| 711 |
+
│ Deep learning │ PyTorch + Transformers │
|
| 712 |
+
│ Logging │ TensorBoard (ev_oracle_grpo_road/) │
|
| 713 |
+
│ Evaluation │ scipy.stats (Wilson, McNemar) │
|
| 714 |
+
│ Viz │ matplotlib + seaborn (committed PNGs) │
|
| 715 |
+
└────────────────┴──────────────────────────────────────────────┘
|
| 716 |
+
```
|
| 717 |
+
|
| 718 |
+
### Why each choice matters
|
| 719 |
+
|
| 720 |
+
**OpenEnv** — standardizes the environment interface so any judge can `POST /reset` and `POST /step` from their own machine without touching training code.
|
| 721 |
+
|
| 722 |
+
**TRL GRPO** — purpose-built for reinforcement learning with verifiable reward functions. The multi-sample-per-prompt structure is exactly what we need for reward comparison.
|
| 723 |
+
|
| 724 |
+
**Unsloth** — doubles throughput on the same GPU. More rollouts per hour = more RL iterations = better policy. Simple math.
|
| 725 |
+
|
| 726 |
+
**Paired evaluation** — using identical seeds for baseline and oracle means any difference in KPIs is attributable to the policy, not random episode variation.
|
| 727 |
+
|
| 728 |
+
---
|
| 729 |
+
|
| 730 |
+
<div align="center">
|
| 731 |
+
|
| 732 |
+
## ⚠️ Critical Warning: LoRA Merging
|
| 733 |
+
|
| 734 |
+
</div>
|
| 735 |
+
|
| 736 |
+
> **Read this before you copy our training setup.**
|
| 737 |
+
|
| 738 |
+
```
|
| 739 |
+
┌──────────────────────────────────────────────────────────────┐
|
| 740 |
+
│ ⚠️ LoRA/QLoRA WARNING │
|
| 741 |
+
│ │
|
| 742 |
+
│ DO NOT naively upcast a 4-bit base to 16-bit and "merge" │
|
| 743 |
+
│ at the end without the correct Unsloth path. │
|
| 744 |
+
│ │
|
| 745 |
+
│ What happens if you do it wrong: │
|
| 746 |
+
│ → Adapter weights applied to wrong quantization scale │
|
| 747 |
+
│ → Model quality degrades silently (hard to detect) │
|
| 748 |
+
│ → Your eval results are meaningless │
|
| 749 |
+
│ │
|
| 750 |
+
│ What to do instead: │
|
| 751 |
+
│ → Save adapters cleanly with Unsloth's save_pretrained │
|
| 752 |
+
│ → Test post-training inference IMMEDIATELY after saving │
|
| 753 |
+
│ → Verify on held-out prompts before running eval harness │
|
| 754 |
+
└──────────────────────────────────────────────────────────────┘
|
| 755 |
+
```
|
| 756 |
+
|
| 757 |
+
---
|
| 758 |
+
|
| 759 |
+
<div align="center">
|
| 760 |
+
|
| 761 |
+
## 📁 Submission Bundle
|
| 762 |
+
|
| 763 |
+
### *Everything a judge needs, in one place.*
|
| 764 |
+
|
| 765 |
+
</div>
|
| 766 |
+
|
| 767 |
+
```
|
| 768 |
+
ev-grid-oracle/
|
| 769 |
+
├── 📄 README.md ← Start here. Judges table, quick links.
|
| 770 |
+
├── 📄 openenv.yaml ← OpenEnv descriptor (discoverable)
|
| 771 |
+
├── 🖥️ server/
|
| 772 |
+
│ └── app.py ← FastAPI entrypoint (OpenEnv endpoints)
|
| 773 |
+
├── 🧠 training/
|
| 774 |
+
│ ├── train_grpo.ipynb ← Full GRPO training notebook (Colab-ready)
|
| 775 |
+
│ ├── evaluate.py ← Paired evaluation script
|
| 776 |
+
│ ├── fair_eval.py ← Wilson + McNemar evaluation
|
| 777 |
+
│ └── make_plots.py ← Artifact plot generation
|
| 778 |
+
├── 📊 artifacts/
|
| 779 |
+
│ ├── eval_dashboard_summary.png ← Six-panel overview
|
| 780 |
+
│ ├── kpi_comparison.png ← Baseline vs oracle KPIs
|
| 781 |
+
│ ├── eval_*.png ← Full plot suite
|
| 782 |
+
│ ├── grpo_loss.png ← Training evidence
|
| 783 |
+
│ ├── grpo_reward.png ← Training evidence
|
| 784 |
+
│ ├── eval_results.json ← Raw numbers (auditable)
|
| 785 |
+
│ └── fair_eval_results.json ← Wilson + McNemar results
|
| 786 |
+
├── 🛠️ tools/
|
| 787 |
+
│ └── export_grpo_tensorboard_plots.py
|
| 788 |
+
└── 📝 docs/
|
| 789 |
+
├── hf-mini-blog-ev-grid-oracle.md ← This file
|
| 790 |
+
├── submission/
|
| 791 |
+
│ ├── training-artifacts-and-logs.md
|
| 792 |
+
│ └── youtube-under-2min-outline.md
|
| 793 |
+
└── hackathon-official-resources.md
|
| 794 |
+
```
|
| 795 |
+
|
| 796 |
+
### Non-negotiables checklist (judges — look for these)
|
| 797 |
+
|
| 798 |
+
| Item | Where | Status |
|
| 799 |
+
|---|---|---|
|
| 800 |
+
| Live HF Space (env runs) | `README.md` → Quick links | ✅ |
|
| 801 |
+
| OpenEnv descriptor | `openenv.yaml` | ✅ |
|
| 802 |
+
| Training notebook (Colab) | `training/train_grpo.ipynb` | ✅ |
|
| 803 |
+
| GRPO reward curve (PNG) | `artifacts/grpo_reward.png` | ✅ |
|
| 804 |
+
| GRPO loss curve (PNG) | `artifacts/grpo_loss.png` | ✅ |
|
| 805 |
+
| Paired eval results (JSON) | `artifacts/eval_results.json` | ✅ |
|
| 806 |
+
| Wilson + McNemar (JSON) | `artifacts/fair_eval_results.json` | ✅ |
|
| 807 |
+
| Full plot suite (PNG) | `artifacts/*.png` | ✅ |
|
| 808 |
+
| Written submission | This file | ✅ |
|
| 809 |
+
| Demo video (< 2 min) | `docs/submission/youtube-*` | ✅ |
|
| 810 |
+
|
| 811 |
+
---
|
| 812 |
+
|
| 813 |
+
<div align="center">
|
| 814 |
+
|
| 815 |
+
## 🔗 Quick Links
|
| 816 |
+
|
| 817 |
+
| Resource | URL |
|
| 818 |
+
|---|---|
|
| 819 |
+
| 🤗 HF Space (live env) | [NITISHRG15102007/ev-grid-oracle](https://huggingface.co/spaces/NITISHRG15102007/ev-grid-oracle) |
|
| 820 |
+
| 📓 Training Notebook | [train_grpo.ipynb (Colab)](https://colab.research.google.com/github/NITISH-R-G/ev-grid-oracle/blob/main/training/train_grpo.ipynb) |
|
| 821 |
+
| 💻 GitHub Repo | [NITISH-R-G/ev-grid-oracle](https://github.com/NITISH-R-G/ev-grid-oracle) |
|
| 822 |
+
| 📋 This Blog (shareable) | [docs/hf-mini-blog-ev-grid-oracle.md](https://github.com/NITISH-R-G/ev-grid-oracle/blob/main/docs/hf-mini-blog-ev-grid-oracle.md) |
|
| 823 |
+
| 📧 Team Lead | [Nitish R.G. on LinkedIn](https://www.linkedin.com/in/nitish-r-g-15-10-2007-rgn/) |
|
| 824 |
+
|
| 825 |
+
---
|
| 826 |
+
|
| 827 |
+
<br/>
|
| 828 |
+
|
| 829 |
+
*Built by **Team Codestreak** · Bangalore, India · 2025*
|
| 830 |
+
|
| 831 |
+
*"The grid doesn't wait. Neither should your model."* ⚡
|
| 832 |
+
|
| 833 |
+
</div>
|