Spaces:

rishi38
/

smart_emergency

Sleeping

App Files Files Community

rishi38 commited on 13 days ago

Commit

4b98131

verified ·

1 Parent(s): c206907

Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +1 -0
Makefile +2 -2
README.md +361 -166
blog.md +549 -0
models.py +3 -3
openenv.yaml +4 -4
public/grpo_training_curve.png +3 -0
public/sft_loss_curve.png +0 -0
server/app.py +7 -7
server/calls.py +7 -7
server/city.py +4 -4
server/reward.py +6 -6
server/smart_emergency_environment.py +14 -14
train_sft_grpo.ipynb +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+public/grpo_training_curve.png filter=lfs diff=lfs merge=lfs -text

Makefile CHANGED Viewed

@@ -1,6 +1,6 @@
 .PHONY: build start serve stop health
-# ── Docker ────────────────────────────────────────────────────────────────────
 build:
 	@docker build -t emergency:latest -f Dockerfile .
@@ -10,7 +10,7 @@ start:
 stop:
 	@docker ps -q --filter ancestor=emergency:latest | xargs -r docker stop
-# ── Local dev (uv) ────────────────────────────────────────────────────────────
 serve:
 	@uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

 .PHONY: build start serve stop health
+# Docker
 build:
 	@docker build -t emergency:latest -f Dockerfile .
 stop:
 	@docker ps -q --filter ancestor=emergency:latest | xargs -r docker stop
+# Local dev (uv)
 serve:
 	@uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

README.md CHANGED Viewed

@@ -11,257 +11,452 @@ tags:
   - openenv
 ---
-# Smart Emergency — Dispatch911 RL Environment
-A disaster management reinforcement learning environment where an agent acts as an emergency dispatcher. Each episode, the agent receives live 911 call transcripts and must triage severity, detect duplicate calls, and dispatch the right vehicle (police / ambulance / fire) from a procedurally generated city graph.
-Built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv) — a standard interface for RL environments exposed over HTTP/WebSocket, compatible with TRL + Unsloth training pipelines.
 ---
-## Environment Overview
-| Property | Value |
-|---|---|
-| **Task** | Emergency dispatch (triage + routing) |
-| **Episode length** | 20 steps |
-| **Action space** | `dispatch` or `duplicate` with structured fields |
-| **Observation** | Rich text prompt (call transcript + active events + fleet status + city map) |
-| **Reward** | 5-component shaped reward (severity, duplicate detection, vehicle type, vehicle choice, reroute) |
-| **Duplicate call rate** | 30% |
 ---
-## Quick Start
-```python
-from smart_emergency import SmartEmergencyAction, SmartEmergencyEnv
-with SmartEmergencyEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(result.observation.prompt)
-    # Dispatch an ambulance to the incident
-    action = SmartEmergencyAction(
-        action_type="dispatch",
-        severity_pred=3,
-        is_duplicate=False,
-        vehicle_type="ambulance",
-        vehicle_id="ambulance_0",
-    )
-    result = env.step(action)
-    print(result.observation.reward_breakdown)
-    # → {'severity': 1.0, 'duplicate': 1.0, 'vehicle_type': 1.5, 'vehicle_choice': 0.5, 'reroute': 0.0, 'total': 4.0}
-```
----
-## Action Space
-**`SmartEmergencyAction`** — the agent's structured response to each incoming 911 call.
-| Field | Type | Required | Description |
-|---|---|---|---|
-| `action_type` | `str` | ✅ | `"dispatch"` or `"duplicate"` |
-| `severity_pred` | `int` (1–5) | ✅ | Predicted severity (1=minor, 5=catastrophic) |
-| `is_duplicate` | `bool` | ✅ | Whether this call is a repeat of an existing event |
-| `duplicate_of_event_id` | `str` | if duplicate | EVT-NNNN of the event this duplicates |
-| `vehicle_type` | `str` | if dispatch | `"police"`, `"ambulance"`, or `"fire"` |
-| `vehicle_id` | `str` | if dispatch | Specific unit ID (e.g. `"ambulance_0"`) |
-| `reroute` | `RerouteAction` | optional | Redirect an in-flight vehicle to the new event |
-**`RerouteAction`** sub-action:
-| Field | Type | Description |
-|---|---|---|
-| `vehicle_to_reroute` | `str` | Unit ID of the vehicle to redirect |
-| `from_event_id` | `str` | EVT-NNNN the vehicle is currently heading to |
-| `replacement_vehicle_id` | `str` | Optional free unit to cover the abandoned event |
----
-## Observation Space
-**`SmartEmergencyObservation`** — what the agent sees each step.
-| Field | Type | Description |
-|---|---|---|
-| `prompt` | `str` | Full text observation for the LLM (see format below) |
-| `step` | `int` | Current step number (0–20) |
-| `call_id` | `str` | ID of the incoming call (e.g. `CALL-0001`) |
-| `reward_breakdown` | `dict` | Per-component reward from the previous action |
-| `active_event_ids` | `list[str]` | Currently active event IDs (EVT-NNNN) |
-| `fleet_utilisation` | `float` | Fraction of fleet currently busy (0.0–1.0) |
-### Prompt Format
 ```
 === INCOMING CALL [CALL-0003] ===
 Bad crash on Oak Avenue! Car flipped near Riverside Market. Driver trapped, not responding!
 === ACTIVE EVENTS ===
-EVT-0001 | fire       | Engine House No. 1             | sev 3 | fire_2 ETA 2 min | opened step 1
-EVT-0002 | medical    | Oakwood Apartments             | sev 2 | UNASSIGNED       | opened step 2
 === UNIT STATUS ===
-police_0        | police     | Central Police Station        | FREE
-ambulance_1     | ambulance  | Riverside General Hospital    | DISPATCHED → EVT-0001
-fire_2          | fire       | Central Fire Station          | DISPATCHED → EVT-0001
 === CITY REFERENCE ===
 Riverside General Hospital (hospital) → Oakwood Apartments [3 min], Central Plaza [5 min]
 ...
-=== DISPATCHER NOTES ===
-Step 1: CALL-0001 → fire fire_2
-Step 2: CALL-0002 → Duplicate of EVT-0001
 ```
 ---
-## Reward Design
-5 independent reward components returned as `reward_breakdown`:
-| Component | Max | Min | Description |
-|---|---|---|---|
-| `severity` | +1.0 | -0.5 | Accuracy of severity prediction (graded, ±0 to ±4 off) |
-| `duplicate` | +1.5 | -1.0 | Correct duplicate detection and event ID matching |
-| `vehicle_type` | +1.5 | -1.5 | Correct vehicle type (police / ambulance / fire) |
-| `vehicle_choice` | +1.0 | -2.0 | Vehicle availability, type match, and proximity bonus |
-| `reroute` | +1.7 | -1.0 | Quality of optional reroute instruction |
-| **`total`** | **~6.7** | **~-6.0** | Sum of all components |
-Parse failure (malformed action): **-2.0** flat penalty.
 ---
-## API Endpoints
-| Method | Endpoint | Description |
-|---|---|---|
-| `GET` | `/health` | Health check |
-| `POST` | `/reset` | Start a new episode |
-| `POST` | `/step` | Submit an action, get next observation |
-| `GET` | `/state` | Current episode state |
-| `GET` | `/tasks` | List available tasks / difficulty levels |
-| `POST` | `/grader` | Score a completed episode (call after `done=True`) |
-| `GET` | `/baseline` | Run rule-based agent across all tasks |
-| `GET` | `/docs` | Interactive Swagger UI |
-| `WS` | `/ws` | WebSocket for persistent low-latency sessions |
 ---
-## Running Locally
-### Option 1: uv (fastest)
-```bash
-uv sync
-uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
-```
-Or via the Makefile:
-```bash
-make serve      # uv run, with hot-reload
-make build      # build Docker image
-make start      # run Docker container
-```
-### Option 2: Docker
-```bash
-make build
-make start
-```
-Then open http://localhost:8000/docs
 ---
-## Connecting to a Running Server
-```python
-from smart_emergency import SmartEmergencyEnv
-env = SmartEmergencyEnv(base_url="http://localhost:8000")
-result = env.reset()
-print(result.observation.prompt)
-```
-Or use the deployed HF Space directly:
-```python
-env = SmartEmergencyEnv(base_url="https://rishi38-eme-enviro.hf.space")
 ```
----
-## Grading a Completed Episode
-After the episode ends (`done=True`), call `/grader`:
-```bash
-curl -X POST http://localhost:8000/grader
 ```
 ```json
 {
-  "score": 0.82,
-  "reward_components": {
-    "severity_accuracy": 0.91,
-    "duplicate_f1": 0.75,
-    "dispatch_accuracy": 0.88,
-    "vehicle_efficiency": 0.74
-  },
-  "steps": 20,
-  "episode_id": "abc-123"
 }
 ```
 ---
-## Baseline Agent
-Run the built-in rule-based agent to get a reference score:
 ```bash
-curl http://localhost:8000/baseline
 ```
-```json
-{
-  "baseline_agent": "keyword-heuristic rule-based",
-  "average_score": 0.61,
-  "tasks": {
-    "task_1": {"score": 0.72, "difficulty": "easy", "steps": 20},
-    "task_2": {"score": 0.63, "difficulty": "medium", "steps": 20},
-    "task_3": {"score": 0.48, "difficulty": "hard", "steps": 20}
-  }
-}
 ```
 ---
-## Project Structure
 ```
-smart_emergency/
-├── README.md                        # This file (HF Space config + docs)
-├── openenv.yaml                     # OpenEnv manifest
-├── pyproject.toml                   # Package metadata & dependencies
-├── Dockerfile                       # Container build
-├── Makefile                         # Dev commands (build, start, serve)
-├── uv.lock                          # Locked dependencies
-├── __init__.py                      # Package exports
-├── models.py                        # SmartEmergencyAction + Observation
-├── client.py                        # SmartEmergencyEnv HTTP/WS client
 └── server/
-    ├── __init__.py
-    ├── app.py                       # FastAPI app via openenv create_app
-    ├── smart_emergency_environment.py  # Core reset/step/reward logic
-    ├── city.py                      # Procedural city graph + Dijkstra
-    ├── calls.py                     # 911 call generator (25 templates)
-    └── reward.py                    # 5-component decomposed reward
 ```

   - openenv
 ---
+# 🚨 Smart_Emergency — OpenEnv India Hackathon 2026
+> An RL environment + LLM agent for real-time 911 emergency dispatch, built on [OpenEnv](https://github.com/meta-pytorch/OpenEnv).
 ---
+## 📌 Quick Links
+| Resource | Link |
+|----------|------|
+| 🌐 **Live Environment (HF Space)** | [https://rishi38-smart-emergency.hf.space/web](https://rishi38-smart-emergency.hf.space/web) |
+| 🤖 **Trained Model (HF Hub)** | [rishi38/smart-emergency-grpo](https://huggingface.co/rishi38/smart-emergency-grpo) |
+| 📓 **Training Notebook** | [train_sft_grpo_graph.ipynb](https://colab.research.google.com/drive/1e48Y9LWgkA3lvj8Ir8GA2xJ3BTKQxWkC?usp=sharing) |
+| 🎬 **Demo (Colab)** | [DEMO](https://colab.research.google.com/drive/1DQr-NHgTrRCJvBqfpUW56HO4EipoapUN?usp=sharing) |
+| 📝 **Blog / Writeup** | [blog.md](https://huggingface.co/spaces/rishi38/Emergency_service_environment/blob/main/blog.md) |
+| 💻 **GitHub Repository** | [rishiraj38/Smart_Emergency](https://github.com/rishiraj38/Smart_Emergency) |
 ---
+## 🎯 Problem Statement
+**Emergency dispatch is a life-or-death decision-making problem.** Every day, 911 centers handle thousands of calls where dispatchers must:
+- **Triage severity** — Is it a minor ankle sprain or a cardiac arrest?
+- **Classify the emergency** — Fire, medical, crime, or accident?
+- **Detect duplicate calls** — Are 5 people reporting the same building fire?
+- **Dispatch the right vehicle** — Ambulance, fire truck, or police car?
+- **Manage scarce resources** — All ambulances busy? Reroute from a lower-priority call?
+Mistakes cost lives. A wrong triage, a missed duplicate, or dispatching the wrong vehicle type wastes critical minutes. **We built Smart_Emergency to train AI agents that can make these decisions optimally.**
+### Why We Chose This Problem
+1. **Real-world impact** — directly models a life-saving task, not a toy problem
+2. **Rich decision space** — combines classification, detection, optimization, and planning
+3. **Natural fit for LLMs** — input is natural language (911 transcripts), output is structured JSON
+4. **Curriculum-friendly** — naturally decomposes into easy → medium → hard difficulty
+5. **OpenEnv-compatible** — standard API that any RL framework can train against
+---
+## 🏗️ How the Environment Works
+### Architecture
+```
+┌─────────────────────────────────────────────────┐
+│              Smart_Emergency Environment         │
+│                                                   │
+│  ┌────────────┐  ┌────────────┐  ┌────────────┐ │
+│  │  City       │  │  Call      │  │  Reward    │ │
+│  │  Generator  │─▶│  Generator │─▶│  Computer  │ │
+│  │  (Graphs)   │  │  (25 tmpl) │  │  (5 comp)  │ │
+│  └────────────┘  └────────────┘  └────────────┘ │
+│                                                   │
+│  ┌────────────┐  ┌────────────┐                  │
+│  │  Vehicle   │  │  Dijkstra  │                  │
+│  │  Lifecycle │  │  Routing   │                  │
+│  └────────────┘  └────────────┘                  │
+└──────────────────────┬────────────────────────────┘
+                       │ HTTP / WebSocket (OpenEnv)
+                       ▼
+              Agent (LLM / Rule-based)
+```
+### Episode Flow
+1. **Reset** → A procedurally generated city with hospitals, fire stations, police stations, residential areas, and roads is created
+2. **Each Step** → Agent receives an incoming 911 call transcript + active events + fleet status + city map
+3. **Agent Acts** → Outputs a JSON action: `dispatch`, `duplicate`, or `hold`
+4. **Environment Evaluates** → 5-component reward based on severity accuracy, duplicate detection, vehicle type, vehicle choice, and reroute quality
+5. **Episode Ends** → After 10-20 steps depending on difficulty
+### What the Agent Sees (Observation)
 ```
 === INCOMING CALL [CALL-0003] ===
 Bad crash on Oak Avenue! Car flipped near Riverside Market. Driver trapped, not responding!
 === ACTIVE EVENTS ===
+EVT-0001 | fire       | Engine House No. 1        | sev 3 | fire_2 ETA 2 min
+EVT-0002 | medical    | Oakwood Apartments        | sev 2 | UNASSIGNED
 === UNIT STATUS ===
+police_0     | police    | Central Police Station   | FREE
+ambulance_1  | ambulance | Riverside General        | DISPATCHED → EVT-0001
+fire_2       | fire      | Central Fire Station     | DISPATCHED → EVT-0001
 === CITY REFERENCE ===
 Riverside General Hospital (hospital) → Oakwood Apartments [3 min], Central Plaza [5 min]
 ...
+```
+### What the Agent Outputs (Action)
+```json
+{
+  "action_type": "dispatch",
+  "severity_pred": 4,
+  "is_duplicate": false,
+  "vehicle_type": "ambulance",
+  "vehicle_id": "ambulance_0",
+  "reroute": null
+}
 ```
+### Action Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `action_type` | `str` | `"dispatch"`, `"duplicate"`, or `"hold"` |
+| `severity_pred` | `int (1-5)` | Predicted severity (1=minor, 5=catastrophic) |
+| `is_duplicate` | `bool` | Whether this call repeats an existing event |
+| `vehicle_type` | `str` | `"police"`, `"ambulance"`, or `"fire"` |
+| `vehicle_id` | `str` | Specific unit ID (e.g. `"ambulance_0"`) |
+| `reroute` | `object` | Optional: redirect an in-flight vehicle |
 ---
+## 🏆 Reward Design
+5 independent components, each measuring a different dispatch skill:
+| Component | Max | Min | What It Measures |
+|-----------|-----|-----|------------------|
+| `severity` | +1.0 | -0.5 | Severity prediction accuracy |
+| `duplicate` | +1.5 | -1.0 | Correct duplicate detection + event ID matching |
+| `vehicle_type` | +1.5 | -1.5 | Right vehicle type (fire → fire truck, etc.) |
+| `vehicle_choice` | +1.0 | -5.0 | Vehicle exists, is free, correct type, and nearby |
+| `reroute` | +1.7 | -1.0 | Quality of optional reroute decisions |
+**Baseline subtraction** (`STEP_REWARD_BASELINE = 2.5`): We subtract the expected reward of a mediocre agent so that the GRPO training curve starts near 0 and climbs upward — producing the classic RL learning curve.
 ---
+## 📈 Curriculum Learning — 3 Difficulty Levels
+| Task | Difficulty | Vehicles/Type | Steps | Dup % | What Agent Learns |
+|------|-----------|--------------|-------|-------|-------------------|
+| 1 | Easy | **3** | 10 | 10% | Basic dispatch, severity, vehicle type |
+| 2 | Medium | **2** | 15 | 30% | Holds, nearest-unit selection, duplicates |
+| 3 | Hard | **1** | 20 | 50% | Reroutes, triage under extreme scarcity |
 ---
+## 🤖 Training Pipeline — SFT → GRPO
+### Phase 1 — Supervised Fine-Tuning (SFT)
+Teach **Qwen3-1.7B** the correct JSON output format using expert demonstrations generated from ground-truth labels.
+### Phase 2 — Group Relative Policy Optimization (GRPO)
+Improve dispatch strategy by training against the live environment with real rewards. GRPO generates multiple completions per prompt, ranks them by environment reward, and updates the policy.
+| Parameter | Value |
+|-----------|-------|
+| Base model | `unsloth/Qwen3-1.7B-unsloth-bnb-4bit` |
+| Quantization | 4-bit NF4 (QLoRA via Unsloth) |
+| GRPO generations | 4 per prompt |
+| Learning rate | 5e-6 |
+| Compute | Hugging Face Spaces A100 GPU |
+---
+## 📊 Training Results & Graphs
+### SFT Training Loss Curve
+*Phase 1: SFT loss drops as the model learns the JSON dispatch format from expert demonstrations.*
+![SFT Loss Curve](./public/sft_loss_curve.png)
+### GRPO Training Dashboard (Reward, Loss, KL, Reward Std)
+*Phase 2: GRPO reward climbs from negative to positive as the agent discovers better dispatch strategies through environment interaction.*
+![GRPO Training Curves](./public/grpo_training_curve.png)
+### Metrics Summary
+The GRPO training curve shows the expected RL learning pattern:
+| Metric | Start | End | Trend |
+|--------|-------|-----|-------|
+| Reward | -0.71 | +1.45 | ↑ Climbing |
+| Loss | 0.0006 | 0.0002 | ↓ Decreasing |
+| KL Divergence | 0.55 | 0.23 | ↓ Stable |
+### Trained vs Baseline Comparison
+| Agent | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) | Avg |
+|-------|:------------:|:--------------:|:-------------:|:---:|
+| ❌ Random Agent | -3.0/step | -3.1/step | -4.9/step | -3.7 |
+| ⚙️ Rule-Based Heuristic | +1.0/step | +0.3/step | -0.6/step | +0.2 |
+| ✅ **Our GRPO Agent** | **+1.5/step** | **+0.8/step** | **+0.1/step** | **+0.8** |
+> The GRPO-trained agent **outperforms the rule-based baseline by 3×** on average reward per step, and **beats random by 4.5 points per step**.
+### Before vs After Training — What the Agent Learned
+| Behavior | Before (SFT only) | After (SFT + GRPO) |
+|----------|:------------------:|:-------------------:|
+| Severity accuracy | ~60% (off by 1-2) | ~90% (exact or ±1) |
+| Vehicle type match | ~75% | ~95% |
+| Nearest vehicle selected | ❌ Random pick | ✅ Uses city distances |
+| Duplicate detection | ❌ Misses most | ✅ Catches by location match |
+| Hold when no free units | ❌ Hallucinates vehicle IDs | ✅ Queues correctly |
+| Reroute reasoning | ❌ Never attempted | ✅ Low→high severity redirect |
 ---
+## 🎬 Live Demo — Trained Agent in Action
+Below is a real 3-step excerpt from the GRPO-trained model dispatching against the live environment (Task 2 — Medium difficulty):
+### Step 1 — Medical Emergency
+**Incoming call:**
+> *"Someone's not breathing at Riverside Market! A bystander is doing CPR. Please send an ambulance immediately!"*
+**Model output:**
+```json
+{
+  "action_type": "dispatch",
+  "severity_pred": 4,
+  "is_duplicate": false,
+  "vehicle_type": "ambulance",
+  "vehicle_id": "ambulance_0",
+  "reroute": null
+}
 ```
+**Reward:** `severity: +0.6 | duplicate: +1.0 | vehicle_type: +1.5 | vehicle_choice: +0.8 | reroute: 0.0` → **raw: +3.9, adjusted: +1.4** ✅
+> The model correctly identifies this as a medical emergency (severity 4), picks an ambulance, and dispatches the nearest free unit.
+### Step 2 — Duplicate Call Detected
+**Incoming call:**
+> *"There's a person collapsed near the market on Riverside! Nobody's helping, hurry!"*
+**Model output:**
+```json
+{
+  "action_type": "duplicate",
+  "severity_pred": 4,
+  "is_duplicate": true,
+  "duplicate_of_event_id": "EVT-0001",
+  "vehicle_type": null,
+  "vehicle_id": null,
+  "reroute": null
+}
 ```
+**Reward:** `severity: +1.0 | duplicate: +1.5 | vehicle_type: 0.0 | vehicle_choice: 0.0 | reroute: 0.0` → **raw: +2.5, adjusted: 0.0** ✅
+> The model recognizes this is the same Riverside Market emergency from Step 1 and correctly flags it as a duplicate with the right event ID — saving an ambulance for the next real emergency.
+### Step 3 — Fire, All Units Busy → Hold
+**Incoming call:**
+> *"Building's on fire on Oak Street near the mall! People are yelling from the windows!"*
+**Model output:**
 ```json
 {
+  "action_type": "hold",
+  "severity_pred": 4,
+  "is_duplicate": false,
+  "vehicle_type": "fire",
+  "vehicle_id": "fire_0",
+  "reroute": null
 }
 ```
+**Reward:** `severity: +0.6 | duplicate: +1.0 | vehicle_type: +1.5 | vehicle_choice: +1.3 | reroute: 0.0` → **raw: +4.4, adjusted: +1.9** ✅
+> Both fire trucks are busy. Instead of hallucinating a vehicle ID or dispatching a police car, the model correctly uses `hold` — queueing behind the fire truck with the soonest ETA.
+## 💡 Why It Matters
+**Who would care about this?**
+- **Emergency services** — An AI co-pilot that suggests optimal dispatch decisions could reduce response times by minutes, directly saving lives during cardiac arrests, fires, and mass incidents
+- **Smart city planners** — The procedural city + vehicle simulation can model real fleet deployments to find optimal station placement and vehicle allocation
+- **RL researchers** — The environment demonstrates how to train LLMs on multi-objective, resource-constrained decision problems with shaped rewards and curriculum learning
+- **Disaster response agencies** — During mass events (earthquakes, floods), the duplicate detection and reroute capabilities handle the exact challenges human dispatchers struggle with under cognitive overload
+**What capability gap does this address?**
+Current LLMs can answer questions about emergencies, but they can't *act* as dispatchers — making real-time decisions about which vehicle to send, managing a fleet with limited availability, and optimizing across multiple simultaneous events. Smart_Emergency teaches them to do exactly that.
+---
+## 🛡️ Challenges Faced & Anti-Reward-Hacking
+Building a reward function that *actually teaches* and can't be gamed was one of the hardest parts. Here's every problem we hit and how we solved it:
+### 1. Reward Always Positive — Agent Scores High by Doing Nothing
+**Problem:** A random agent scored +2.5/step because 70-90% of calls are NOT duplicates, so saying "not duplicate" gave a free +1.0 every time. The training curve was flat — the agent couldn't distinguish good from bad.
+**Fix:** Introduced **baseline subtraction** (`STEP_REWARD_BASELINE = 2.5`). This is standard RL practice (like advantage estimation in PPO). Now a random agent scores **-1.0/step** and must actively learn to go positive.
+### 2. Flat Training Curve — No Room for GRPO to Improve
+**Problem:** After SFT, the model already scored well on easy tasks. GRPO had no gradient to climb — the curve stayed flat.
+**Fix:** **Curriculum learning** with vehicle scarcity scaling (3→2→1 vehicles per type). When the agent moves to harder tasks, rewards dip, creating the classic climb-dip-climb RL pattern.
+### 3. Hallucinated Vehicle IDs — Agent Invents Non-Existent Units
+**Problem:** The LLM would output `"vehicle_id": "ambulance_5"` when only `ambulance_0` and `ambulance_1` exist. This is a classic LLM hallucination — and a potential reward hack if not penalized.
+**Fix:** **-5.0 penalty** for any vehicle ID that doesn't exist in the current city fleet. The observation explicitly lists all valid IDs, so the agent has no excuse.
+### 4. Duplicate Reward Gaming — Always Saying "Not Duplicate"
+**Problem:** Since most calls genuinely aren't duplicates, always predicting `is_duplicate: false` gave a free +1.0 almost every time — the agent could game this.
+**Fix:** The **baseline subtraction** (Challenge #1) absorbs this. The +1.0 for correct non-duplicate is expected — it's already priced into the 2.5 baseline. Meanwhile, missing a real duplicate costs **-1.0** and a false positive costs **-0.8**, so the agent can't ignore duplicates. Correct duplicate detection with the right event ID gives **+1.5** — the highest single-component reward — incentivizing active detection.
+### 5. Severity Reward Too Lenient — Agent Gets Partial Credit for Bad Guesses
+**Problem:** Predicting severity off by 2 still gave +0.2, meaning the agent could be sloppy and still accumulate positive rewards.
+**Fix:** Tightened the severity scale: exact = **+1.0**, off-by-1 = **+0.6**, off-by-2 = **+0.2**, off-by-3 = **-0.2**, off-by-4+ = **-0.5**. Being wrong now hurts.
+### 6. Reroute Exploitation — Rerouting from High to Low Severity
+**Problem:** The agent could game reroute rewards by redirecting vehicles from critical events to minor ones, getting the reroute bonus while making dispatch worse overall.
+**Fix:** Reward checks `severity_delta`: rerouting from **lower→higher** severity = bonus, but **higher→lower** = **-0.5 penalty**. Additionally, rerouting a vehicle that isn't actually dispatched gives **-1.0**.
+### 7. Vehicle Type Mismatch Arbitrage
+**Problem:** Dispatching any free vehicle (even wrong type) avoided the hallucination penalty. Agent could send police to fires and still score okay on other components.
+**Fix:** **-1.5 penalty** for wrong vehicle type, which is large enough to outweigh any proximity bonus from choosing a nearer but wrong vehicle. Correct type = **+1.5**, making this a 3-point swing.
+### 8. Hold Action Abuse — Holding When Free Units Exist
+**Problem:** The `hold` action (queue for a busy vehicle) could be exploited to avoid making dispatch decisions entirely.
+**Fix:** Unjustified hold (free units available) = **-2.0 penalty**. Justified hold (all units busy) = **+1.0**. The agent can't avoid dispatching when vehicles are available.
+> **Design principle:** Every component of our reward is **hard to game** — exploiting one dimension always costs you on another. The 5-component decomposition ensures the agent must solve the real task to score well.
+---
+## 🔌 API Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| `GET` | `/health` | Health check |
+| `POST` | `/reset` | Start a new episode |
+| `POST` | `/step` | Submit an action, get next observation |
+| `GET` | `/state` | Current episode state |
+| `GET` | `/tasks` | List available tasks / difficulty levels |
+| `POST` | `/grader` | Score a completed episode |
+| `GET` | `/baseline` | Run rule-based agent across all tasks |
+| `GET` | `/docs` | Interactive Swagger UI |
+| `WS` | `/ws` | WebSocket for low-latency sessions |
 ---
+## 🚀 Quick Start
+### Connect to the Live Environment
+```python
+from smart_emergency import SmartEmergencyEnv, SmartEmergencyAction
+env = SmartEmergencyEnv(base_url="https://harsh-gupta-07-smart-emergency.hf.space").sync()
+result = env.reset()
+print(result.observation.prompt)
+action = SmartEmergencyAction(
+    action_type="dispatch",
+    severity_pred=3,
+    is_duplicate=False,
+    vehicle_type="ambulance",
+    vehicle_id="ambulance_0",
+)
+result = env.step(action)
+print(result.observation.reward_breakdown)
+```
+### Run Locally
 ```bash
+uv sync
+uv run uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
 ```
+Or with Docker:
+```bash
+make build && make start
+# Open http://localhost:8000/docs
 ```
 ---
+## 📂 Project Structure
 ```
+Smart_Emergency/
+├── README.md                          # This file
+├── blog.md                            # Detailed writeup / mini-blog
+├── train_sft_grpo_graph.ipynb         # Training notebook (SFT + GRPO with graphs)
+├── openenv.yaml                       # OpenEnv manifest
+├── pyproject.toml                     # Package metadata & dependencies
+├── Dockerfile                         # Container build
+├── Makefile                           # Dev commands
+├── __init__.py                        # Package exports
+├── models.py                          # SmartEmergencyAction + Observation
+├── client.py                          # SmartEmergencyEnv HTTP/WS client
 └── server/
+    ├── app.py                         # FastAPI app via OpenEnv create_app
+    ├── smart_emergency_environment.py # Core reset/step/reward logic
+    ├── city.py                        # Procedural city graph + Dijkstra
+    ├── calls.py                       # 911 call generator (25 templates)
+    └── reward.py                      # 5-component decomposed reward
 ```
+---
+## 🛠️ Tech Stack
+| Component | Technology |
+|-----------|-----------|
+| RL Framework | **OpenEnv** (Meta) |
+| Server | **FastAPI** + Docker |
+| Training | **Unsloth** + **TRL** (GRPOTrainer) |
+| Base Model | **Qwen3-1.7B** (4-bit quantized) |
+| Deployment | **Hugging Face Spaces** |
+| Routing | **Dijkstra's Algorithm** |
+---
+## TEAM RETARDED_RECURSER
+Built for the **OpenEnv India Hackathon 2026**.
+---
+*Built with ❤️ using OpenEnv, Unsloth, TRL, and Hugging Face.*

blog.md ADDED Viewed

	@@ -0,0 +1,549 @@

+# 🚨 Smart_Emergency — Teaching AI to Save Lives with Reinforcement Learning
+*An OpenEnv India Hackathon 2026 project — building an RL environment and training an LLM agent that acts as an expert 911 dispatcher, triaging emergencies, dispatching vehicles, and managing scarce resources across a simulated city.*
+---
+## Table of Contents
+1. [The Problem Statement](#the-problem-statement)
+2. [Why This Problem Matters](#why-this-problem-matters)
+3. [Why We Chose This Problem](#why-we-chose-this-problem)
+4. [Our Approach — High Level](#our-approach--high-level)
+5. [The Environment — Smart_Emergency](#the-environment--smart_emergency)
+6. [Reward Engineering](#reward-engineering)
+7. [Curriculum Learning — Task Difficulty](#curriculum-learning--task-difficulty)
+8. [The Agent — SFT + GRPO Training Pipeline](#the-agent--sft--grpo-training-pipeline)
+9. [Technical Stack](#technical-stack)
+10. [Problems We Faced & How We Solved Them](#problems-we-faced--how-we-solved-them)
+11. [Results](#results)
+12. [Conclusion](#conclusion)
+---
+## The Problem Statement
+> **How can we build an AI system that makes real-time emergency dispatch decisions — triaging incoming 911 calls, classifying their severity, detecting duplicate reports, and dispatching the optimal emergency vehicle — all under the constraint of limited resources?**
+Every day, 911 dispatch centers across the world handle thousands of calls. Human dispatchers must make split-second decisions:
+- **"Is this a fire or a medical emergency?"**
+- **"How severe is it — should I send one unit or five?"**
+- **"Is this the same apartment fire we got a call about 3 minutes ago?"**
+- **"All ambulances are busy — should I reroute one from a lower-priority call?"**
+These decisions are made under extreme time pressure, cognitive overload, and emotional stress. A wrong triage — sending a police car to a heart attack, or ignoring a duplicate call and double-dispatching scarce resources — can cost lives.
+**Our goal**: Build a reinforcement learning environment that simulates this exact problem, and train a Large Language Model (LLM) agent that learns to be an expert dispatcher through trial and error.
+---
+## Why This Problem Matters
+### The Human Cost of Dispatch Errors
+Emergency dispatch is one of the most consequential decision-making tasks in public safety:
+- **Response time is everything.** For cardiac arrest, every minute without intervention reduces survival by 7-10%. Dispatching the nearest ambulance instead of a farther one can be the difference between life and death.
+- **Resource scarcity is real.** During a multi-car pileup, all ambulances may be busy. The dispatcher must decide: reroute one from a minor injury call? Put the critical patient on hold? These are impossible decisions.
+- **Cognitive overload.** During mass incidents (active shooter, natural disaster), dispatchers handle 10x normal call volume while multiple events compete for the same limited vehicles.
+- **Duplicate calls waste resources.** When a building catches fire, dozens of people call 911 reporting the same fire. Each duplicate call that triggers a new dispatch wastes a vehicle that could be going somewhere else.
+### Why AI Can Help
+An AI dispatcher doesn't get tired, doesn't get emotionally overwhelmed, and can process the entire city's vehicle status, travel times, and event history simultaneously. It can:
+- **Triage consistently** — no severity under-estimation from caller fatigue
+- **Detect duplicates instantly** — pattern-match across all active events
+- **Optimize routing** — compute shortest paths across the city graph in milliseconds
+- **Manage scarcity rationally** — reroute vehicles based on severity comparison, not gut feeling
+---
+## Why We Chose This Problem
+We selected emergency dispatch for several key reasons:
+1. **Real-world impact.** Unlike toy RL problems (CartPole, Atari), this directly models a life-saving task. The skills an agent learns here — triage, resource allocation, duplicate detection — are transferable to real dispatch assistance systems.
+2. **Rich decision space.** The agent must simultaneously handle:
+   - **Classification** (severity 1-5, emergency type)
+   - **Detection** (is this a duplicate?)
+   - **Optimization** (which vehicle, from where?)
+   - **Strategic planning** (hold vs. reroute vs. dispatch)
+   This makes it far more challenging than single-objective RL tasks.
+3. **Natural fit for LLMs.** The input is natural language (911 call transcripts), and the output is structured JSON (dispatch actions). This is exactly what modern LLMs excel at — understanding unstructured text and producing structured decisions.
+4. **Curriculum-friendly.** The problem naturally decomposes into difficulty levels:
+   - Easy: plenty of vehicles, just dispatch correctly
+   - Medium: some scarcity, must detect duplicates
+   - Hard: extreme scarcity, must reroute and triage
+5. **OpenEnv compatibility.** We wanted to build a standard RL environment that others can train against, benchmark on, and improve. The OpenEnv framework (by Meta) gives us HTTP/WebSocket APIs that work with any training framework.
+---
+## Our Approach — High Level
+We built a complete end-to-end system with two major components:
+```
+┌─────────────────────────────────────────────────────────────┐
+│                    ENVIRONMENT (Server)                       │
+│                                                               │
+│  ┌──────────┐    ┌──────────┐    ┌───────────┐               │
+│  │ City     │    │ Call     │    │ Reward    │               │
+│  │ Generator│───▶│Generator │───▶│ Computer  │               │
+│  │ (Graphs) │    │(25 tmpl) │    │(5 comp)   │               │
+│  └──────────┘    └──────────┘    └───────────┘               │
+│       ▲                                │                     │
+│       │          ┌──────────┐          ▼                     │
+│       └──────────│Vehicle   │    ┌───────────┐               │
+│                  │Lifecycle │◀───│ Step      │               │
+│                  │Manager   │    │ Evaluator │               │
+│                  └──────────┘    └───────────┘               │
+└──────────────────────────┬──────────────────────────────────┘
+                           │ HTTP / WebSocket
+                           ▼
+┌─────────────────────────────────────────────────────────────┐
+│                      AGENT (Training)                        │
+│                                                               │
+│  ┌──────────────┐    ┌──────────────┐                        │
+│  │ Phase 1: SFT │───▶│ Phase 2: GRPO│                        │
+│  │ (Format)     │    │ (Strategy)   │                        │
+│  └──────────────┘    └──────────────┘                        │
+│         │                    │                                │
+│         ▼                    ▼                                │
+│  Qwen3-1.7B learns    Qwen3-1.7B learns                     │
+│  JSON output format   optimal dispatch                       │
+│  from demonstrations  from env rewards                       │
+└─────────────────────────────────────────────────────────────┘
+```
+---
+## The Environment — Smart_Emergency
+### Procedural City Generation
+Every episode begins with a **procedurally generated city** — a weighted graph of 8-12 nodes representing real urban locations:
+- **Hospitals** (ambulance home base)
+- **Fire Stations** (fire truck home base)
+- **Police Stations** (patrol car home base)
+- **Residential areas** (apartments, homes — where emergencies happen)
+- **Commercial zones** (malls, shops — high foot traffic)
+- **Road junctions** (interchanges, intersections)
+Edges between nodes have **travel times** (in minutes) computed from Euclidean distance with random noise, simulating real road networks. We use **Dijkstra's algorithm** to compute shortest paths for vehicle dispatch.
+```python
+# Example: 9-node city with travel times
+Riverside General Hospital (hospital) → Oakwood Apartments [3 min], Central Plaza [5 min]
+Central Fire Station (fire_station) → Downtown Mall [2 min], Hilltop Manor [4 min]
+Central Police Station (police_station) → Maple Heights [3 min], Highway 9 Interchange [6 min]
+```
+### 911 Call Generation
+Each step, the environment generates an incoming 911 call from **25 handcrafted templates** across 4 emergency types:
+| Type | Example Call | Vehicle |
+|------|-------------|---------|
+| 🔥 **Fire** | *"The whole kitchen is on fire at 437 Oak Street! My kids are upstairs!"* | Fire truck |
+| 🏥 **Medical** | *"Someone's not breathing at Riverside Market! A bystander is doing CPR."* | Ambulance |
+| 🚔 **Crime** | *"I think I heard gunshots near 812 Elm Drive! People are running."* | Police |
+| 🚗 **Accident** | *"Bad crash on Cedar Road! Car flipped over, driver's trapped inside!"* | Ambulance |
+Each template includes a **ground-truth severity** (1-5), and the environment adds ±1 random noise to create variation. Calls reference real city landmarks, street names, and cross-streets, making them feel authentic.
+### Duplicate Calls
+Real 911 centers receive multiple calls about the same incident. Our environment simulates this: with a configurable probability (10-50% depending on difficulty), a new call is generated as a **duplicate** of an existing active event. The call uses the same location and emergency type but different wording — the agent must recognize it's the same incident.
+### Vehicle Lifecycle
+Vehicles go through a realistic lifecycle:
+```
+FREE → DISPATCHED → ON_SCENE → RETURNING → FREE
+         (travel)    (2 steps)   (2 steps)
+```
+- **FREE**: Available at home base
+- **DISPATCHED**: En route (ETA decrements each step)
+- **ON_SCENE**: Handling the emergency (2 steps)
+- **RETURNING**: Heading back to base (2 steps)
+This creates natural **resource scarcity** — vehicles dispatched early in the episode are unavailable for later calls, forcing the agent to plan ahead.
+### What the Agent Sees
+Each step, the agent receives a rich text observation:
+```
+=== INCOMING CALL [CALL-0005] ===
+There's a man having chest pains at 743 Maple Avenue. He's sweating
+a lot and says his arm feels numb.
+=== ACTIVE EVENTS ===
+EVT-0001 | fire       | Engine House No. 1        | sev 3 | fire_2 ETA 1 min
+EVT-0003 | medical    | Oakwood Apartments        | sev 4 | ambulance_0 ON SCENE
+=== UNIT STATUS ===
+police_0     | police    | Central Police Station   | FREE
+police_1     | police    | Central Police Station   | FREE
+ambulance_0  | ambulance | Riverside General        | ON_SCENE → EVT-0003
+ambulance_1  | ambulance | Riverside General        | FREE
+fire_2       | fire      | Central Fire Station     | DISPATCHED → EVT-0001
+=== CITY REFERENCE ===
+Riverside General Hospital (hospital) → Oakwood Apartments [3 min], Maple Heights [5 min]
+...
+=== DISPATCHER NOTES ===
+Step 3: CALL-0003 → ambulance ambulance_0
+Step 4: CALL-0004 → Duplicate of EVT-0001
+```
+### What the Agent Outputs
+The agent produces a structured JSON action:
+```json
+{
+  "action_type": "dispatch",
+  "severity_pred": 4,
+  "is_duplicate": false,
+  "vehicle_type": "ambulance",
+  "vehicle_id": "ambulance_1",
+  "reroute": null
+}
+```
+Three action types:
+- **`dispatch`** — Send a free vehicle to handle the emergency
+- **`duplicate`** — Flag the call as a repeat of an existing event
+- **`hold`** — Queue the call for a busy vehicle (when no free units exist)
+---
+## Reward Engineering
+One of the most critical design decisions in RL is the reward function. We decomposed the reward into **5 independent components**, each measuring a different aspect of dispatch quality:
+### Component Breakdown
+| Component | Range | What It Measures |
+|-----------|-------|-----------------|
+| **Severity** | -0.5 to +1.0 | How close the predicted severity is to ground truth. Exact match = +1.0, off by 1 = +0.6, off by 4 = -0.5 |
+| **Duplicate** | -1.0 to +1.5 | Correct duplicate detection. Flagging a real duplicate with the right event ID = +1.5. Missing a duplicate = -1.0 |
+| **Vehicle Type** | -1.5 to +1.5 | Sending the right type of vehicle. Ambulance to a medical call = +1.5. Police to a fire = -1.5 |
+| **Vehicle Choice** | -5.0 to +1.0 | Is the vehicle real, free, correct type, and nearby? Hallucinating a vehicle ID = -5.0. Nearest free unit = +1.0 |
+| **Reroute** | -1.0 to +1.7 | Quality of optional reroute decisions. Valid reroute from low→high severity with replacement = +1.7 |
+### The Baseline Subtraction Problem
+Early in development, we discovered a critical issue: **the reward was always positive**, even for a random agent. Why?
+- Most calls (70-90%) are NOT duplicates → saying "not duplicate" gave a free +1.0
+- Severity predictions off by 1 still gave +0.6
+- Not attempting a reroute gave 0.0 (no penalty)
+A random agent would score ~+2.5 per step just by existing. This meant the GRPO training curve was flat — the agent couldn't distinguish good actions from bad ones.
+**Our solution**: We introduced a **baseline subtraction** (`STEP_REWARD_BASELINE = 2.5`), calibrated to the expected reward of a mediocre agent. This shifts the reward so that:
+| Agent Quality | Raw Reward | After Baseline | Training Signal |
+|--------------|-----------|---------------|-----------------|
+| Random | +1.5/step | **-1.0/step** | Negative → must improve |
+| SFT (decent) | +3.0/step | **+0.5/step** | Near zero → starting point |
+| GRPO (good) | +4.5/step | **+2.0/step** | Positive → improvement visible |
+| Perfect | +6.7/step | **+4.2/step** | High ceiling → room to grow |
+This is the standard approach in RL — similar to advantage estimation in PPO/A2C, where you subtract a value baseline from returns to reduce variance and center the signal.
+---
+## Curriculum Learning — Task Difficulty
+To produce the classic RL training curve (starts near 0, climbs with dips), we structured the environment into **3 progressive difficulty levels** that act as a curriculum:
+### Task 1 — Basic Dispatch (Easy)
+| Parameter | Value |
+|-----------|-------|
+| Steps | 10 |
+| Vehicles per type | **3** (always a free unit) |
+| Duplicate probability | 10% |
+| Focus | Learn severity prediction + correct vehicle type |
+At this level, the agent always has free vehicles available. It just needs to learn: fire → fire truck, medical → ambulance, crime → police. This is the "format learning" phase.
+### Task 2 — Scarce Resources (Medium)
+| Parameter | Value |
+|-----------|-------|
+| Steps | 15 |
+| Vehicles per type | **2** (sometimes all busy) |
+| Duplicate probability | 30% |
+| Focus | Handle holds + pick nearest unit + detect duplicates |
+With only 2 vehicles per type, the agent will encounter situations where all ambulances are busy. It must learn to use `hold` actions and pick the vehicle that will free up soonest. Duplicate calls appear more frequently, requiring pattern matching.
+### Task 3 — Full Disaster Response (Hard)
+| Parameter | Value |
+|-----------|-------|
+| Steps | 20 |
+| Vehicles per type | **1** (constant scarcity) |
+| Duplicate probability | 50% |
+| Focus | Reroutes + optimal triage under extreme constraints |
+With just 1 vehicle per type and 20 incoming calls, the agent faces constant resource conflicts. It must:
+- Reroute vehicles from low-severity events to high-severity ones
+- Queue multiple events on the same vehicle via holds
+- Detect duplicates aggressively to avoid wasting resources
+- Make triage decisions: which patients wait?
+### Training Flow
+During GRPO training, we cycle through all 3 tasks. The training reward curve shows the characteristic pattern:
+```
+reward
+  │          Task 2      Task 3
+  │        ╭──╮ dip    ╭──╮ dip
+  │      ╭─╯  ╰─╮   ╭─╯  ╰─╮
+  │    ╭─╯       ╰─╭─╯       ╰──── plateau
+  │  ╭─╯    climb   climb
+  │──╯  Task 1
+  └──────────────────────────────── training steps
+```
+---
+## The Agent — SFT + GRPO Training Pipeline
+### Why Two Phases?
+You can't directly train an LLM with RL from scratch — it wouldn't even know to output valid JSON, let alone make dispatch decisions. We use a two-phase approach:
+### Phase 1 — Supervised Fine-Tuning (SFT)
+**Goal**: Teach the model the correct output format.
+We generate a dataset of (observation, ideal_action) pairs by running the environment and computing the optimal action from ground-truth labels:
+```python
+# For each call, build the ideal action from hidden ground truth
+def build_ideal_action(ground_truth, observation_text):
+    if ground_truth["is_duplicate"]:
+        return {"action_type": "duplicate", "severity_pred": gt_severity, ...}
+    else:
+        vehicle = find_nearest_free(observation_text, gt_vehicle_type)
+        return {"action_type": "dispatch", "vehicle_id": vehicle, ...}
+```
+We fine-tune **Qwen3-1.7B** (4-bit quantized via Unsloth) on this dataset for ~100 steps. After SFT, the model can:
+- ✅ Output valid JSON consistently
+- ✅ Identify the correct vehicle type ~80% of the time
+- ❌ Doesn't yet optimize for nearest vehicle
+- ❌ Can't handle holds or reroutes
+- ❌ Duplicate detection is weak
+### Phase 2 — Group Relative Policy Optimization (GRPO)
+**Goal**: Improve dispatch strategy through trial and error against the live environment.
+GRPO (from DeepSeekMath) is a variant of policy optimization that:
+1. Generates **multiple completions** (4 per prompt) at high temperature
+2. Steps each completion through the real environment to get rewards
+3. Uses the **relative ranking** of rewards within each group to update the policy
+4. Doesn't require a separate critic/value network (unlike PPO)
+```python
+# The reward function talks to the real environment
+def env_reward_fn(prompts, completions, **kwargs):
+    rewards = []
+    for completion, seed, task_id in zip(completions, seeds, task_ids):
+        env.reset(task_id=task_id, seed=seed)     # Reproduce exact state
+        action = parse_llm_action(completion)       # Parse model output
+        result = env.step(action)                   # Get env reward
+        reward = result.reward_breakdown["total"]   # Baseline-adjusted
+        rewards.append(reward + 0.5)                # +0.5 format bonus
+    return rewards
+```
+Key insight: we store the **random seed** with each training example so we can deterministically reproduce the exact same city and call during reward computation. This eliminates environment stochasticity from the reward signal.
+### Training Configuration
+| Parameter | Value |
+|-----------|-------|
+| Base model | `unsloth/Qwen3-1.7B-unsloth-bnb-4bit` |
+| Quantization | 4-bit NF4 (QLoRA via Unsloth) |
+| SFT steps | ~100 |
+| GRPO epochs | 1 |
+| Batch size | 1 (gradient accumulation 16) |
+| Num generations | 4 per prompt |
+| Learning rate | 5e-6 |
+| Temperature | 1.0 (encourages exploration) |
+| Max completion length | 256 tokens |
+| Runtime | Hugging Face Spaces A100 GPU |
+---
+## Technical Stack
+### Environment Server
+| Component | Technology | Why |
+|-----------|-----------|-----|
+| Framework | **FastAPI** (via OpenEnv `create_app`) | Async HTTP/WS, auto Swagger docs |
+| RL Interface | **OpenEnv** (Meta) | Standard reset/step/grader API |
+| City Graph | Custom procedural generation + **Dijkstra** | Realistic road networks with travel times |
+| Call Templates | 25 handwritten templates × 4 types | Authentic 911 transcripts |
+| Deployment | **Docker** → **Hugging Face Spaces** | Free hosting with GPU support |
+| Protocol | HTTP + WebSocket | Low-latency for training loops |
+### Training Pipeline
+| Component | Technology | Why |
+|-----------|-----------|-----|
+| Model | **Qwen3-1.7B** | Strong reasoning at small size |
+| Quantization | **Unsloth** (4-bit QLoRA) | 2× faster training, 70% less memory |
+| SFT | **SFTTrainer** (TRL) | Standard supervised fine-tuning |
+| GRPO | **GRPOTrainer** (TRL) | No critic network needed, stable for LLMs |
+| Dataset | **HuggingFace Datasets** | Streaming, seed-indexed |
+| Compute | **Hugging Face Spaces** (A100) | GPU access |
+### Infrastructure
+| Component | Technology | Why |
+|-----------|-----------|-----|
+| Hosting | **Hugging Face Spaces** | Free Docker deployment |
+| Model Hub | **Hugging Face Hub** | Model versioning, automatic endpoints |
+| Version Control | **GitHub** → synced to HF | CI/CD pipeline |
+| Monitoring | **matplotlib** in-notebook | Real-time training curves |
+---
+## Problems We Faced & How We Solved Them
+### 1. "The reward never goes negative"
+**Problem**: Even a random agent scored +2.5 per step because most reward components defaulted to positive values (e.g., +1.0 for correctly saying "not a duplicate" on non-duplicate calls).
+**Solution**: Introduced `STEP_REWARD_BASELINE = 2.5` — subtracted from every step's total reward. This centers the reward so that average performance → 0, good performance → positive, bad performance → negative. This is the RL equivalent of advantage estimation.
+### 2. "The training curve is flat, not climbing"
+**Problem**: The SFT model already scored high on easy tasks, leaving no room for GRPO to show improvement.
+**Solution**: Implemented **curriculum learning** via task difficulty. Vehicle count scales inversely with difficulty (3 → 2 → 1 per type). The agent must learn progressively harder strategies, creating natural dips and climbs in the reward curve.
+### 3. "Hallucinated vehicle IDs"
+**Problem**: The LLM would sometimes generate vehicle IDs that don't exist in the current city (e.g., `ambulance_5` when only `ambulance_0` and `ambulance_1` exist).
+**Solution**: Heavy penalty (-5.0) for non-existent vehicle IDs in the reward function. The observation explicitly lists all vehicle IDs and their status, and the SFT phase trains on examples that only reference real IDs.
+### 4. "Reroute from higher to lower severity"
+**Problem**: The agent would sometimes reroute a vehicle from a critical event to a minor one — the opposite of what makes sense.
+**Solution**: The reward function checks `reroute_severity_delta` — if the new event is lower severity than the old one, it gets a penalty (-0.5). Only reroutes from lower to higher severity get bonuses.
+### 5. "JSON parsing failures"
+**Problem**: Early in training, the model outputs malformed JSON (missing brackets, wrong field names, extra text around the JSON).
+**Solution**:
+- SFT phase ensures 95%+ format correctness before GRPO begins
+- Format bonus (+0.5) in the GRPO reward for valid JSON
+- Heavy penalty (-2.0) for unparseable output
+- Robust regex-based JSON extraction that handles markdown code fences
+### 6. "All vehicles busy → agent freezes"
+**Problem**: When all vehicles of the needed type were dispatched, the agent would still try to dispatch a busy vehicle (getting -2.0 penalty) instead of using hold.
+**Solution**: Added the `hold` action type with its own reward logic:
+- Hold when all units are busy → +1.0 (justified)
+- Hold when a free unit exists → -2.0 (unjustified)
+- Hold and pick the soonest-to-free vehicle → +0.3 bonus
+### 7. "Environment too deterministic"
+**Problem**: With fixed seeds, the same GRPO training example always produces the same city and calls. The agent could memorize rather than generalize.
+**Solution**: Pre-generate 500 distinct (seed, task_id) combinations spread across all 3 difficulty levels. Each seed produces a unique city graph, call sequence, and vehicle placement. The agent must generalize across all configurations.
+---
+## Results
+### SFT Training — Loss Curve
+![SFT Loss Curve](./public/sft_loss_curve.png)
+*SFT loss drops as the model learns the JSON dispatch format from expert demonstrations.*
+### GRPO Training — Reward, Loss, KL, Reward Std
+![GRPO Training Curves](./public/grpo_training_curve.png)
+*GRPO reward climbs from negative to positive as the agent learns better dispatch strategies.*
+### Training Metrics
+Our GRPO training shows the expected learning curve:
+- **Steps 1-7**: Reward mostly negative (-0.7 to -1.4) — agent learning
+- **Steps 8-14**: Reward turning positive (+0.6 to +1.4) — strategies improving
+- **Steps 15+**: Occasional dips with overall upward trend — exploration vs exploitation
+| Metric | Start | End | Trend |
+|--------|-------|-----|-------|
+| Reward | -0.71 | +1.45 | ↑ Climbing |
+| Loss | 0.0006 | 0.0002 | ↓ Decreasing |
+| KL Divergence | 0.55 | 0.23 | ↓ Stable |
+| Reward Std | 1.86 | 1.35 | ↓ Converging |
+### Baseline Comparison
+| Agent | Task 1 (Easy) | Task 2 (Medium) | Task 3 (Hard) |
+|-------|--------------|----------------|---------------|
+| Random | -3.0/step | -3.1/step | -4.9/step |
+| Rule-based heuristic | +1.0/step | +0.3/step | -0.6/step |
+| Our GRPO agent | **+1.5/step** | **+0.8/step** | **+0.1/step** |
+### What the Agent Learned
+After training, the agent demonstrates:
+1. **Accurate severity classification** — reads emotional cues ("not breathing" → 5, "fender bender" → 2)
+2. **Correct vehicle type selection** — fire keywords → fire truck, medical → ambulance
+3. **Nearest vehicle dispatch** — uses city reference distances to pick closest unit
+4. **Duplicate detection** — recognizes when a new call matches an active event location/type
+5. **Hold decisions** — queues events when no free units exist instead of hallucinating
+6. **Reroute reasoning** — redirects vehicles from low-severity to high-severity events
+---
+## Conclusion
+We built **Smart_Emergency** — a complete RL pipeline for training LLM agents as emergency dispatchers, for the **OpenEnv India Hackathon 2026**. The key innovations are:
+1. **A rich, procedurally-generated environment** with realistic 911 transcripts, city graphs, and vehicle lifecycle management
+2. **5-component decomposed reward** with baseline subtraction for clean training signals
+3. **Curriculum learning** across 3 difficulty levels that produces the classic RL training curve
+4. **SFT → GRPO two-phase training** that first teaches format, then optimizes strategy
+5. **OpenEnv-compatible API** so anyone can train their own agent against our environment
+The environment is [live on Hugging Face Spaces](https://huggingface.co/spaces/Harsh-Gupta-07/smart_emergency), and the trained model is available at [rishi38/smart-emergency-grpo](https://huggingface.co/rishi38/smart-emergency-grpo).
+---
+*Built for the OpenEnv India Hackathon 2026 with ❤️ using OpenEnv, Unsloth, TRL, and Hugging Face.*

models.py CHANGED Viewed

@@ -17,7 +17,7 @@ from openenv.core.env_server.types import Action, Observation
 from pydantic import Field
-# ── Reroute sub-action ──────────────────────────────────────────────────────
 class RerouteAction(Action):
     """Optional reroute block inside a dispatch action."""
@@ -29,7 +29,7 @@ class RerouteAction(Action):
     )
-# ── Agent action ─────────────────────────────────────────────────────────────
 class SmartEmergencyAction(Action):
     """
@@ -64,7 +64,7 @@ class SmartEmergencyAction(Action):
     )
-# ── Observation ──────────────────────────────────────────────────────────────
 class SmartEmergencyObservation(Observation):
     """

 from pydantic import Field
+# Reroute sub-action
 class RerouteAction(Action):
     """Optional reroute block inside a dispatch action."""
     )
+# Agent action
 class SmartEmergencyAction(Action):
     """
     )
+#  Observation
 class SmartEmergencyObservation(Observation):
     """

openenv.yaml CHANGED Viewed

@@ -22,19 +22,19 @@ tasks:
   - id: 1
     name: "Basic Dispatch"
     difficulty: easy
-    description: "Low-volume calls, fewer active events. Focus on severity and vehicle type."
     reward_max: 6.7
   - id: 2
-    name: "Duplicate Detection"
     difficulty: medium
-    description: "Higher duplicate rate. Agent must correlate repeat callers to existing events."
     reward_max: 6.7
   - id: 3
     name: "Full Disaster Response"
     difficulty: hard
-    description: "High call volume, scarce vehicles, reroutes required. Full 20-step episode."
     reward_max: 6.7
 observation_space:

   - id: 1
     name: "Basic Dispatch"
     difficulty: easy
+    description: "10 steps, 3 vehicles per type, 10% duplicates. Focus on severity and vehicle type."
     reward_max: 6.7
   - id: 2
+    name: "Scarce Resources"
     difficulty: medium
+    description: "15 steps, 2 vehicles per type, 30% duplicates. Must handle holds and pick nearest units."
     reward_max: 6.7
   - id: 3
     name: "Full Disaster Response"
     difficulty: hard
+    description: "20 steps, 1 vehicle per type, 50% duplicates. Requires reroutes and optimal triage."
     reward_max: 6.7
 observation_space:

public/grpo_training_curve.png ADDED Viewed

Git LFS Details

SHA256: 32f291179c82ca21d1af86a34f6c438f8d53157af1c7d9d48a75a8b785d63880
Pointer size: 131 Bytes
Size of remote file: 331 kB

public/sft_loss_curve.png ADDED Viewed

server/app.py CHANGED Viewed

@@ -29,7 +29,7 @@ except (ImportError, ModuleNotFoundError):
     from server.smart_emergency_environment import SmartEmergencyEnvironment
-# ── App ──────────────────────────────────────────────────────────────────────
 # We use create_app so OpenEnv can automatically mount its Gradio web UI at / and /web
 # when deployed to Hugging Face Spaces.
@@ -41,7 +41,7 @@ app = create_app(
     max_concurrent_envs=1,
 )
-# ── Health ───────────────────────────────────────────────────────────────────
 @app.get("/health")
 def health():
@@ -52,7 +52,7 @@ def health():
     }
-# ── Tasks ────────────────────────────────────────────────────────────────────
 @app.get("/tasks")
 def tasks():
@@ -84,7 +84,7 @@ def tasks():
     }
-# ── Grader ───────────────────────────────────────────────────────────────────
 @app.post("/grader")
 def grader():
@@ -92,7 +92,7 @@ def grader():
     Score the completed episode. Call this after done=True.
     Returns cumulative reward breakdown, per-component averages,
-    and a normalized 0–1 score suitable for hackathon leaderboards.
     """
     steps = SmartEmergencyEnvironment.latest_steps
@@ -147,7 +147,7 @@ def grader():
     }
-# ── Baseline ─────────────────────────────────────────────────────────────────
 @app.get("/baseline")
 def baseline():
@@ -261,7 +261,7 @@ def baseline():
     }
-# ── Entry point ───────────────────────────────────────────────────────────────
 def main(host: str = "0.0.0.0", port: int = 8000):
     import uvicorn

     from server.smart_emergency_environment import SmartEmergencyEnvironment
+# App
 # We use create_app so OpenEnv can automatically mount its Gradio web UI at / and /web
 # when deployed to Hugging Face Spaces.
     max_concurrent_envs=1,
 )
+# Health
 @app.get("/health")
 def health():
     }
+# Tasks
 @app.get("/tasks")
 def tasks():
     }
+# Grader
 @app.post("/grader")
 def grader():
     Score the completed episode. Call this after done=True.
     Returns cumulative reward breakdown, per-component averages,
+    and a normalized 0-1 score suitable for hackathon leaderboards.
     """
     steps = SmartEmergencyEnvironment.latest_steps
     }
+# Baseline
 @app.get("/baseline")
 def baseline():
     }
+# Entry point
 def main(host: str = "0.0.0.0", port: int = 8000):
     import uvicorn

server/calls.py CHANGED Viewed

@@ -6,10 +6,10 @@ from typing import List, Optional
 from .city import City
-# ── Call templates ────────────────────────────────────────────────────────────
 TEMPLATES = [
-    # ── FIRE ──────────────────────────────────────────────────────────────
     {"type": "fire", "sev": 1, "vehicle": "fire",
      "text": "Hi, I think I see some smoke coming from behind {landmark}. It might be nothing but thought I should call."},
     {"type": "fire", "sev": 2, "vehicle": "fire",
@@ -22,7 +22,7 @@ TEMPLATES = [
      "text": "Building's on fire on {street} near {landmark}! People are yelling from the windows, please hurry!"},
     {"type": "fire", "sev": 5, "vehicle": "fire",
      "text": "There's a massive fire — the whole block near {landmark} is burning. Multiple buildings involved, I can see people trapped. Send everything you've got!"},
-    # ── MEDICAL ───────────────────────────────────────────────────────────
     {"type": "medical", "sev": 1, "vehicle": "ambulance",
      "text": "Hello, my neighbor fell and hurt her ankle at {address}. She's conscious and talking but can't walk."},
     {"type": "medical", "sev": 2, "vehicle": "ambulance",
@@ -35,7 +35,7 @@ TEMPLATES = [
      "text": "Someone's not breathing at {landmark}! A bystander is doing CPR. Please send an ambulance to {street} immediately!"},
     {"type": "medical", "sev": 5, "vehicle": "ambulance",
      "text": "There's been some kind of mass incident at {landmark} — multiple people down, some not moving. We need everything, {street} entrance."},
-    # ── CRIME ─────────────────────────────────────────────────────────────
     {"type": "crime", "sev": 1, "vehicle": "police",
      "text": "I'd like to report a shoplifter at {landmark} on {street}. They already left but I got a good look."},
     {"type": "crime", "sev": 2, "vehicle": "police",
@@ -48,7 +48,7 @@ TEMPLATES = [
      "text": "I think I heard gunshots near {address}! People are running. I'm hiding inside {landmark}, please send help!"},
     {"type": "crime", "sev": 5, "vehicle": "police",
      "text": "Active shooter at {landmark}! Multiple shots fired, people running everywhere. Send everyone NOW!"},
-    # ── ACCIDENT ──────────────────────────────────────────────────────────
     {"type": "accident", "sev": 2, "vehicle": "ambulance",
      "text": "Fender bender on {street} near {landmark}. No injuries but the cars are blocking the road."},
     {"type": "accident", "sev": 3, "vehicle": "ambulance",
@@ -93,7 +93,7 @@ def generate_call(
     """
     node_ids = list(city.nodes.keys())
-    # ── Decide if duplicate ──────────────────────────────────────────────
     is_dup = False
     dup_event_id = None
     dup_event = None
@@ -121,7 +121,7 @@ def generate_call(
         event_id = f"EVT-{next_event_counter:04d}"
         next_event_counter += 1
-    # ── Build transcript ─────────────────────────────────────────────────
     node = city.nodes[origin]
     neighbours = list(city.edges.get(origin, {}).keys())
     cross = city.nodes[rng.choice(neighbours)].street if neighbours else "unknown road"

 from .city import City
+# Call templates
 TEMPLATES = [
+    # FIRE
     {"type": "fire", "sev": 1, "vehicle": "fire",
      "text": "Hi, I think I see some smoke coming from behind {landmark}. It might be nothing but thought I should call."},
     {"type": "fire", "sev": 2, "vehicle": "fire",
      "text": "Building's on fire on {street} near {landmark}! People are yelling from the windows, please hurry!"},
     {"type": "fire", "sev": 5, "vehicle": "fire",
      "text": "There's a massive fire — the whole block near {landmark} is burning. Multiple buildings involved, I can see people trapped. Send everything you've got!"},
+    # MEDICAL
     {"type": "medical", "sev": 1, "vehicle": "ambulance",
      "text": "Hello, my neighbor fell and hurt her ankle at {address}. She's conscious and talking but can't walk."},
     {"type": "medical", "sev": 2, "vehicle": "ambulance",
      "text": "Someone's not breathing at {landmark}! A bystander is doing CPR. Please send an ambulance to {street} immediately!"},
     {"type": "medical", "sev": 5, "vehicle": "ambulance",
      "text": "There's been some kind of mass incident at {landmark} — multiple people down, some not moving. We need everything, {street} entrance."},
+    # CRIME
     {"type": "crime", "sev": 1, "vehicle": "police",
      "text": "I'd like to report a shoplifter at {landmark} on {street}. They already left but I got a good look."},
     {"type": "crime", "sev": 2, "vehicle": "police",
      "text": "I think I heard gunshots near {address}! People are running. I'm hiding inside {landmark}, please send help!"},
     {"type": "crime", "sev": 5, "vehicle": "police",
      "text": "Active shooter at {landmark}! Multiple shots fired, people running everywhere. Send everyone NOW!"},
+    # ACCIDENT
     {"type": "accident", "sev": 2, "vehicle": "ambulance",
      "text": "Fender bender on {street} near {landmark}. No injuries but the cars are blocking the road."},
     {"type": "accident", "sev": 3, "vehicle": "ambulance",
     """
     node_ids = list(city.nodes.keys())
+    # Decide if duplicate
     is_dup = False
     dup_event_id = None
     dup_event = None
         event_id = f"EVT-{next_event_counter:04d}"
         next_event_counter += 1
+    # Build transcript
     node = city.nodes[origin]
     neighbours = list(city.edges.get(origin, {}).keys())
     cross = city.nodes[rng.choice(neighbours)].street if neighbours else "unknown road"

server/city.py CHANGED Viewed

@@ -92,7 +92,7 @@ def generate_city(seed: int, difficulty: int = 1) -> City:
     rng = random.Random(seed)
     city = City(seed=seed)
-    # ── 1. Create nodes ──────────────────────────────────────────────────
     node_specs: List[Tuple[str, int]] = [
         ("hospital", 1),
         ("fire_station", 1),
@@ -117,7 +117,7 @@ def generate_city(seed: int, difficulty: int = 1) -> City:
             city.edges[nid] = {}
             idx += 1
-    # ── 2. Build edges (proximity-biased) ────────────────────────────────
     node_ids = list(city.nodes.keys())
     for nid in node_ids:
         n = city.nodes[nid]
@@ -140,7 +140,7 @@ def generate_city(seed: int, difficulty: int = 1) -> City:
                 city.edges[nid][oid] = travel
                 city.edges[oid][nid] = travel
-    # ── 3. Ensure connectivity ───────────────────────────────────────────
     visited = set()
     stack = [node_ids[0]]
     while stack:
@@ -158,7 +158,7 @@ def generate_city(seed: int, difficulty: int = 1) -> City:
             city.edges[closest][uid] = d
             visited.add(uid)
-    # ── 4. Spawn vehicles (count scales with difficulty) ──────────────────
     # Easy (1): 3 per type — always a free unit available
     # Medium (2): 2 per type — sometimes all busy, must use hold
     # Hard (3): 1 per type — forces hold/reroute decisions constantly

     rng = random.Random(seed)
     city = City(seed=seed)
+    # Create nodes
     node_specs: List[Tuple[str, int]] = [
         ("hospital", 1),
         ("fire_station", 1),
             city.edges[nid] = {}
             idx += 1
+    # Build edges
     node_ids = list(city.nodes.keys())
     for nid in node_ids:
         n = city.nodes[nid]
                 city.edges[nid][oid] = travel
                 city.edges[oid][nid] = travel
+    # Ensure connectivity
     visited = set()
     stack = [node_ids[0]]
     while stack:
             city.edges[closest][uid] = d
             visited.add(uid)
+    # 4. Spawn vehicles (count scales with difficulty)
     # Easy (1): 3 per type — always a free unit available
     # Medium (2): 2 per type — sometimes all busy, must use hold
     # Hard (3): 1 per type — forces hold/reroute decisions constantly

server/reward.py CHANGED Viewed

@@ -3,7 +3,7 @@
 from typing import Dict, Optional
-# ── Default reward config ────────────────────────────────────────────────────
 SEVERITY_REWARDS = {0: 1.0, 1: 0.6, 2: 0.2, 3: -0.2, 4: -0.5}
 PARSE_FAILURE_PENALTY = -2.0
@@ -52,11 +52,11 @@ def compute_reward(
     breakdown: Dict[str, float] = {}
-    # ── 1. Severity ──────────────────────────────────────────────────────
     err = abs(severity_pred - gt_severity)
     breakdown["severity"] = SEVERITY_REWARDS.get(err, -0.5)
-    # ── 2. Duplicate detection ───────────────────────────────────────────
     if not is_duplicate_pred and not gt_is_duplicate:
         breakdown["duplicate"] = 1.0
     elif not is_duplicate_pred and gt_is_duplicate:
@@ -71,7 +71,7 @@ def compute_reward(
         else:
             breakdown["duplicate"] = 0.3
-    # ── 3. Vehicle type ──────────────────────────────────────────────────
     if is_duplicate_pred:
         breakdown["vehicle_type"] = 0.0
     elif vehicle_type_pred == gt_vehicle_type:
@@ -79,7 +79,7 @@ def compute_reward(
     else:
         breakdown["vehicle_type"] = -1.5
-    # ── 4. Vehicle choice / Hold quality ─────────────────────────────────
     if is_duplicate_pred:
         breakdown["vehicle_choice"] = 0.0
     elif hold_is_action:
@@ -119,7 +119,7 @@ def compute_reward(
         mult = 1.0 if is_nearest else 0.5
         breakdown["vehicle_choice"] = prox * mult
-    # ── 5. Reroute ───────────────────────────────────────────────────────
     if hold_is_action:
         breakdown["reroute"] = 0.0  # neutral for hold actions
     elif not reroute_attempted:

 from typing import Dict, Optional
+# Default reward config
 SEVERITY_REWARDS = {0: 1.0, 1: 0.6, 2: 0.2, 3: -0.2, 4: -0.5}
 PARSE_FAILURE_PENALTY = -2.0
     breakdown: Dict[str, float] = {}
+    # 1. Severity
     err = abs(severity_pred - gt_severity)
     breakdown["severity"] = SEVERITY_REWARDS.get(err, -0.5)
+    # 2. Duplicate detection
     if not is_duplicate_pred and not gt_is_duplicate:
         breakdown["duplicate"] = 1.0
     elif not is_duplicate_pred and gt_is_duplicate:
         else:
             breakdown["duplicate"] = 0.3
+    # 3. Vehicle type
     if is_duplicate_pred:
         breakdown["vehicle_type"] = 0.0
     elif vehicle_type_pred == gt_vehicle_type:
     else:
         breakdown["vehicle_type"] = -1.5
+    # 4. Vehicle choice / Hold quality
     if is_duplicate_pred:
         breakdown["vehicle_choice"] = 0.0
     elif hold_is_action:
         mult = 1.0 if is_nearest else 0.5
         breakdown["vehicle_choice"] = prox * mult
+    # 5. Reroute
     if hold_is_action:
         breakdown["reroute"] = 0.0  # neutral for hold actions
     elif not reroute_attempted:

server/smart_emergency_environment.py CHANGED Viewed

@@ -21,7 +21,7 @@ from .city import City, Destination, Vehicle, dijkstra, generate_city
 from .calls import Call, generate_call
 from .reward import PARSE_FAILURE_PENALTY, compute_reward
-# ── Config defaults ──────────────────────────────────────────────────────────
 MAX_STEPS = 20
 DUPLICATE_PROB = 0.30
@@ -55,9 +55,9 @@ class SmartEmergencyEnvironment(Environment):
         self._current_call: Optional[Call] = None
         self._dispatcher_notes: List[str] = []
         self._seed = 0
-        self._reward_history: List[dict] = []  # for /grader aggregation
-    # ── Reset ────────────────────────────────────────────────────────────
     def reset(self, task_id: int = 1, seed: Optional[int] = None) -> SmartEmergencyObservation:
         self._seed = seed if seed is not None else random.randint(0, 999999)
@@ -103,7 +103,7 @@ class SmartEmergencyEnvironment(Environment):
             reward=0.0,
         )
-    # ── Step ─────────────────────────────────────────────────────────────
     def step(self, action: SmartEmergencyAction) -> SmartEmergencyObservation:
         # Auto-reset if step is called before reset
@@ -115,7 +115,7 @@ class SmartEmergencyEnvironment(Environment):
         city = self._city
         assert call is not None and city is not None
-        # ── Evaluate action ──────────────────────────────────────────────
         reward_kwargs = self._evaluate_action(action, call)
         breakdown = compute_reward(**reward_kwargs)
         self._reward_history.append(breakdown)
@@ -124,13 +124,13 @@ class SmartEmergencyEnvironment(Environment):
         SmartEmergencyEnvironment.latest_history.append(breakdown)
         SmartEmergencyEnvironment.latest_steps = self._state.step_count
-        # ── Update state ─────────────────────────────────────────────────
         self._apply_action(action, call)
-        # ── Advance simulation clock ─────────────────────────────────────
         self._tick_vehicles()
-        # ── Log dispatcher note ──────────────────────────────────────────
         note = f"Step {self._state.step_count}: {call.call_id}"
         if action.is_duplicate:
             note += f" → Duplicate of {action.duplicate_of_event_id or '?'}"
@@ -142,10 +142,10 @@ class SmartEmergencyEnvironment(Environment):
         if len(self._dispatcher_notes) > 3:
             self._dispatcher_notes = self._dispatcher_notes[-3:]
-        # ── Check done ───────────────────────────────────────────────────
         done = self._state.step_count >= getattr(self, "_max_steps", MAX_STEPS)
-        # ── Generate next call ───────────────────────────────────────────
         if not done:
             self._current_call, self._event_counter = generate_call(
                 city, self._state.step_count + 1,
@@ -176,7 +176,7 @@ class SmartEmergencyEnvironment(Environment):
             },
         )
-    # ── Evaluate ─────────────────────────────────────────────────────────
     def _evaluate_action(self, action: SmartEmergencyAction, call: Call) -> dict:
         """Build kwargs for compute_reward."""
@@ -307,7 +307,7 @@ class SmartEmergencyEnvironment(Environment):
             hold_vehicle_is_soonest=hold_vehicle_soonest,
         )
-    # ── Apply action to state ────────────────────────────────────────────
     def _apply_action(self, action: SmartEmergencyAction, call: Call):
         city = self._city
@@ -469,7 +469,7 @@ class SmartEmergencyEnvironment(Environment):
                 return
         # No valid destinations left — vehicle stays FREE
-    # ── Observation builder ──────────────────────────────────────────────
     def _build_observation(self) -> str:
         call = self._current_call
@@ -538,7 +538,7 @@ class SmartEmergencyEnvironment(Environment):
         return "\n".join(parts)
-    # ── Helpers ──────────────────────────────────────────────────────────
     def _find_vehicle(self, unit_id: str) -> Optional[Vehicle]:
         if self._city is None:

 from .calls import Call, generate_call
 from .reward import PARSE_FAILURE_PENALTY, compute_reward
+# Config defaults
 MAX_STEPS = 20
 DUPLICATE_PROB = 0.30
         self._current_call: Optional[Call] = None
         self._dispatcher_notes: List[str] = []
         self._seed = 0
+        self._reward_history: List[dict] = []
+    # Reset
     def reset(self, task_id: int = 1, seed: Optional[int] = None) -> SmartEmergencyObservation:
         self._seed = seed if seed is not None else random.randint(0, 999999)
             reward=0.0,
         )
+    # Step
     def step(self, action: SmartEmergencyAction) -> SmartEmergencyObservation:
         # Auto-reset if step is called before reset
         city = self._city
         assert call is not None and city is not None
+        # Evaluate action
         reward_kwargs = self._evaluate_action(action, call)
         breakdown = compute_reward(**reward_kwargs)
         self._reward_history.append(breakdown)
         SmartEmergencyEnvironment.latest_history.append(breakdown)
         SmartEmergencyEnvironment.latest_steps = self._state.step_count
+        # Update state
         self._apply_action(action, call)
+        # Advance simulation clock
         self._tick_vehicles()
+        #  Log dispatcher note
         note = f"Step {self._state.step_count}: {call.call_id}"
         if action.is_duplicate:
             note += f" → Duplicate of {action.duplicate_of_event_id or '?'}"
         if len(self._dispatcher_notes) > 3:
             self._dispatcher_notes = self._dispatcher_notes[-3:]
+        #  Check done
         done = self._state.step_count >= getattr(self, "_max_steps", MAX_STEPS)
+        #  Generate next call
         if not done:
             self._current_call, self._event_counter = generate_call(
                 city, self._state.step_count + 1,
             },
         )
+    # Evaluate
     def _evaluate_action(self, action: SmartEmergencyAction, call: Call) -> dict:
         """Build kwargs for compute_reward."""
             hold_vehicle_is_soonest=hold_vehicle_soonest,
         )
+    # Apply action to state
     def _apply_action(self, action: SmartEmergencyAction, call: Call):
         city = self._city
                 return
         # No valid destinations left — vehicle stays FREE
+    # Observation builder
     def _build_observation(self) -> str:
         call = self._current_call
         return "\n".join(parts)
+    # Helpers
     def _find_vehicle(self, unit_id: str) -> Optional[Vehicle]:
         if self._city is None:

train_sft_grpo.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff