roshan5emerald commited on
Commit
9155bc6
·
verified ·
1 Parent(s): cf4b402

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. README.md +432 -432
  2. server/app.py +12 -1
README.md CHANGED
@@ -1,432 +1,432 @@
1
- ---
2
- title: LogiFlow-RL
3
- emoji: "⭐"
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: docker
7
- app_port: 8000
8
- pinned: false
9
- base_path: /web
10
- ---
11
-
12
- # LogiFlow-RL — Smart Supply Chain Crisis Management
13
-
14
- > **Training an LLM to route shipments proactively across a 12-node global supply chain — before disruptions cascade, not after.**
15
-
16
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)](https://github.com/meta-pytorch/OpenEnv)
17
- [![HuggingFace Space](https://img.shields.io/badge/🤗%20Space-Live-green)](https://huggingface.co/spaces/<your-space-url>)
18
- [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
19
- [![Theme](https://img.shields.io/badge/Theme-Long--Horizon%20Planning-purple)](https://github.com/meta-pytorch/OpenEnv)
20
-
21
- ---
22
-
23
- ## The Problem
24
-
25
- The 2021 Suez Canal blockage held up **$9 billion of goods every day**. Every major retailer missed
26
- delivery SLAs that quarter. The root cause was not the blockage itself — it was that routing systems
27
- identified the disruption **after** the cascade had already propagated: port backed up → upstream
28
- warehouses overloaded → supplier shipments stalled → retail shelves empty.
29
-
30
- Modern logistics software is fundamentally **reactive**. It alerts managers when things have already
31
- gone wrong. What the industry needs is an agent that can read early warning signals — rising node
32
- loads, congestion trends, disruption probability — and reroute **before** the cascade.
33
-
34
- This is what LogiFlow-RL trains.
35
-
36
- ---
37
-
38
- ## What This Project Does
39
-
40
- LogiFlow-RL is an **OpenEnv-compliant reinforcement learning environment** that simulates a
41
- 12-node global supply chain operating under stochastic disruptions. An LLM agent is trained via
42
- GRPO to act as a proactive logistics crisis manager: observing partial network state, reasoning
43
- about disruption trajectories, and routing shipments to prevent overloads before they cascade.
44
-
45
- ```
46
- Suppliers → Warehouses → Distribution Centres → Retail Sinks
47
- 4 → 3 → 3 → 2
48
- ```
49
-
50
- The environment is **genuinely hard to solve** — a round-robin baseline scores only 0.469 average
51
- and achieves **0% SLA compliance**, because it cannot see far enough ahead to prioritise
52
- time-sensitive shipments. Even a well-designed heuristic struggles on cascade scenarios.
53
-
54
- ---
55
-
56
- ## The Capability Gap Being Targeted
57
-
58
- | What LLMs are good at today | What this environment trains |
59
- |---|---|
60
- | One-shot Q&A and summarisation | Multi-step sequential decisions |
61
- | Full information, short context | Partial observability, long horizon |
62
- | Static prompts | Dynamic world state that changes every step |
63
- | Reactive reasoning | Anticipatory planning under uncertainty |
64
-
65
- > **Research framing:** Could a researcher write a paper about training on this environment?
66
- > Yes — it targets long-horizon planning under partial observability with delayed reward signals,
67
- > a recognised capability gap in current LLM architectures.
68
-
69
- ---
70
-
71
- ## Environment Architecture
72
-
73
- ### Network Topology (12 nodes, 4 tiers)
74
-
75
- ```
76
- [Node 0] Supplier North ──┐
77
- [Node 1] Supplier West ──┼──► [Node 4] Warehouse Alpha ──► [Node 7] DC Metro ──► [Node 10] Retail North
78
- [Node 2] Supplier Port ──┼──► [Node 5] Warehouse Beta ──► [Node 8] DC Central ──► ↕
79
- [Node 3] Supplier Inland──┘──► [Node 6] Warehouse Gamma ──► [Node 9] DC Coastal ──► [Node 11] Retail South
80
- ```
81
-
82
- Every node has: **capacity**, **current load**, **drain rate**, **risk score**, and
83
- **typed connections** to downstream nodes. Freight takes **2–4 steps** to transit
84
- between nodes — the agent must plan ahead, not just react.
85
-
86
- ### What Makes This Hard
87
-
88
- **1. Partial observability.** The agent sees only nodes within 2 hops of the current
89
- shipment source. Nodes beyond that radius appear as `null` in the observation. The agent
90
- must infer hidden network state from what flows downstream — exactly like a real logistics
91
- manager working from regional reports, not a global dashboard.
92
-
93
- **2. Stochastic cascade disruptions.** Disruptions trigger probabilistically based on each
94
- node's `risk_score` and the episode's `disruption_rate`. When a node is disrupted, connected
95
- downstream nodes have a `cascade_rate` chance of also disrupting within 2 steps. These
96
- cascades cannot be predicted or memorised — they require genuine situational reasoning.
97
-
98
- **3. Priority-demand windows.** Certain shipments carry SLA deadlines and preferred retail
99
- destinations. Missing a priority window is penalised proportionally to how late the delivery
100
- arrives. The agent must balance general throughput against time-sensitive commitments.
101
-
102
- **4. Dynamic pressure feedback.** The environment tracks a `dynamic_pressure` scalar that
103
- combines overload ratio, SLA gap, and active disruptions. This pressure feeds back into
104
- disruption probability and effective shipment volumes — creating a self-reinforcing difficulty
105
- that rewards proactive management.
106
-
107
- ### Three Difficulty Tiers
108
-
109
- | Task | Title | Steps | Disruption Rate | Cascade Rate | Objective |
110
- |------|-------|-------|-----------------|--------------|-----------|
111
- | **Easy** | Regional Network Balancing | 50 | 0.05 | 0.10 | Keep utilisation balanced while moving freight to retail within SLA |
112
- | **Medium** | Flash Sale With Port Risk | 70 | 0.09 | 0.16 | Recover from burst demand and port slowdowns; prevent warehouse spillovers |
113
- | **Hard** | Cascading Disruption Recovery | 90 | 0.12 | 0.22 | Stabilise a partially observable chain through weather events, supplier failures, and cascade disruptions |
114
-
115
- ### Action Space
116
-
117
- At each step the agent receives a natural language observation and must output:
118
-
119
- ```json
120
- {
121
- "reasoning": "Port 2 is trending toward congestion. Warehouse Beta has 33% buffer capacity. Routing via Beta avoids the likely cascade.",
122
- "source_node": 2,
123
- "dest_node": 5,
124
- "shipment_volume": 18.5
125
- }
126
- ```
127
-
128
- The `reasoning` field is not just cosmetic — it is **required** by the reward function and
129
- is what judges and users actually see when demonstrating the trained model.
130
-
131
- ### Observation Space
132
-
133
- ```python
134
- CrisisLogisticsObservation(
135
- step_count = 14,
136
- max_steps = 90,
137
- visible_node_ids = [2, 5, 6, 8, 9], # 2-hop visibility only
138
- observed_node_loads = [67.3, None, 44.1, None, 51.7, None, ...], # null = hidden
139
- node_capacities = [90.0, None, 125.0, ...],
140
- active_disruptions = [{"node": 2, "kind": "weather", "remaining_steps": 3}],
141
- in_transit_shipments= [{"dest": 5, "volume": 14.2, "remaining_steps": 2}],
142
- pending_source_node = 2,
143
- incoming_load = 21.5,
144
- dynamic_pressure = 0.38,
145
- cumulative_score = 0.61,
146
- last_reward = 0.72,
147
- )
148
- ```
149
-
150
- ---
151
-
152
- ## Reward Design
153
-
154
- The environment uses a **7-component weighted grader** to prevent reward hacking and
155
- ensure every aspect of logistics performance is measured independently.
156
-
157
- ### Episode Grader (`graders.py`)
158
-
159
- | Component | Weight | What It Measures |
160
- |-----------|--------|-----------------|
161
- | Bottleneck avoidance | 18% | How often any node exceeded capacity |
162
- | Network balance | 18% | Average load-gap between most and least loaded nodes |
163
- | Step reward | 14% | Average per-step reward across the episode |
164
- | Retail delivery | 20% | Freight actually delivered to retail nodes vs target |
165
- | SLA compliance | 15% | Deliveries arriving within their deadline window |
166
- | Disruption recovery | 10% | How quickly the network stabilised after each disruption |
167
- | Action validity | 5% | Fraction of legal (connected) routing decisions |
168
-
169
- ### Training Reward (`action_reward` in `train_grpo.py`)
170
-
171
- The GRPO training reward is a 5-component verifiable reward:
172
-
173
- | Component | Max | What It Checks |
174
- |-----------|-----|---------------|
175
- | Valid JSON | 0.20 | Output is parseable JSON |
176
- | Required keys | 0.20 | All 4 fields present: reasoning, source, dest, volume |
177
- | Correct source node | 0.20 | source_node matches the episode's current shipment |
178
- | Connected destination | 0.25 | dest_node is a legal neighbour of source_node |
179
- | Plausible volume | 0.15 | 0 < shipment_volume ≤ 60 and close to incoming load |
180
-
181
- ### Anti-Gaming Guards
182
-
183
- - Reward only counts on **confirmed delivery**, not on dispatch
184
- - **Route-repeat penalty** for consecutive identical routing decisions
185
- - **Risk penalty** for routing through actively disrupted nodes
186
- - **Overload penalty** applied even if JSON format is perfect
187
- - All reward components are **independent** — gaming one does not inflate others
188
-
189
- ---
190
-
191
- ## Training
192
-
193
- ### Method: SFT Warm-Up → GRPO
194
-
195
- Training uses a two-phase approach:
196
-
197
- **Phase 1 — SFT Warm-Up (20 steps)**
198
- Qwen2.5-0.5B-Instruct does not reliably output valid JSON from a cold start. A brief supervised
199
- fine-tuning step on ideal routing examples teaches the model the output format. Without this,
200
- GRPO sees reward = 0 for most early generations and cannot learn.
201
-
202
- **Phase 2 — GRPO (200 steps)**
203
- Starting from the SFT checkpoint, GRPO optimises the model against the verifiable reward function.
204
- The model generates 4 completions per prompt; GRPO compares them within the group and pushes the
205
- model toward higher-scoring routing decisions.
206
-
207
- ### Training Stack
208
-
209
- ```
210
- OpenEnv environment → live rollout prompts → TRL GRPOTrainer
211
- + Unsloth (QLoRA r=16)
212
- + Qwen2.5-0.5B-Instruct
213
- ```
214
-
215
- | Parameter | Value |
216
- |-----------|-------|
217
- | Base model | `Qwen/Qwen2.5-0.5B-Instruct` |
218
- | Adapter | LoRA r=16, α=32 |
219
- | Optimiser | GRPO via TRL |
220
- | Max steps | 200 |
221
- | Generations per prompt | 4 |
222
- | Learning rate | 5e-6 |
223
- | GPU | T4 (Colab free tier) |
224
- | Total training time | ~45 minutes |
225
-
226
- ---
227
-
228
- ## Results
229
-
230
- ### Baseline Policy Comparison
231
-
232
- The table below shows three hand-coded baselines evaluated on all three tasks
233
- **before any LLM training**. These are the targets the trained model must beat.
234
-
235
- | Policy | Avg Score | Avg SLA Rate | Avg Priority Service | Avg Invalid Actions |
236
- |--------|-----------|-------------|---------------------|---------------------|
237
- | **Round-Robin** | 0.469 | 0.0% | 0.0% | 2.0 |
238
- | **Heuristic** | 0.782 | 100.0% | 6.6% | 3.3 |
239
- | **Resilient** | 0.776 | 100.0% | 4.3% | 3.0 |
240
-
241
- **Key insight:** Round-robin achieves 0% SLA success rate despite reasonable step rewards —
242
- because it ignores delivery deadlines entirely. Heuristic achieves 100% SLA but still
243
- fails on priority service (6.6%) and produces invalid actions under disruption.
244
- The trained GRPO model targets both gaps.
245
-
246
- ### Per-Task Breakdown
247
-
248
- | Task | Round-Robin | Heuristic | Resilient |
249
- |------|------------|-----------|-----------|
250
- | Easy | 0.473 | 0.768 | 0.761 |
251
- | Medium | 0.472 | 0.763 | 0.752 |
252
- | Hard | 0.461 | 0.814 | 0.814 |
253
-
254
- ### Training Evidence
255
-
256
- The reward curve below shows GRPO training progress. After the SFT warm-up,
257
- the model starts producing valid JSON immediately and reward climbs from the first steps.
258
-
259
- ![Reward Curve](image-1.png)
260
- *Figure 1: GRPO training reward over 200 logging steps
261
-
262
- ![Before vs After](artifacts/before_after_comparison.png)
263
- *Figure 2: Policy comparison across all three task difficulties. Green bars = trained model
264
- (after GRPO). Blue bars = base model (before GRPO). Amber bars = heuristic baseline.*
265
-
266
- ![Metrics Panel](artifacts/metrics_panel.png)
267
- *Figure 3: Detailed metrics breakdown — overall score, SLA rate, retail delivered, invalid
268
- actions, and bottlenecks — for all three policies across all three tasks.*
269
-
270
- ---
271
-
272
- ## What the Trained Agent Thinks
273
-
274
- Below is an example of the trained Qwen2.5-0.5B model reasoning through a hard-task
275
- disruption scenario at step 14. This is the chain-of-thought the model produces before
276
- taking an action:
277
-
278
- ```
279
- Situation: Port 2 is at 87% load with an active weather disruption (3 steps remaining).
280
- Warehouse Beta has 44% load and 33% buffer capacity. 21.5 units incoming from Supplier Port.
281
-
282
- Model output:
283
- {
284
- "reasoning": "Supplier Port (node 2) is experiencing a weather disruption with 3 steps
285
- remaining and is near capacity at 87%. Routing through node 5 (Warehouse Beta) which
286
- has significant buffer at 44% capacity and is not disrupted. This avoids contributing
287
- to the congestion at node 2 and reduces cascade risk to downstream DC Coastal.",
288
- "source_node": 2,
289
- "dest_node": 5,
290
- "shipment_volume": 21.5
291
- }
292
- ```
293
-
294
- The heuristic would route to the nearest available node. The trained model routes to the
295
- node that minimises cascade probability — a fundamentally different reasoning pattern.
296
-
297
- ---
298
-
299
- ## Running Locally
300
-
301
- ### Start the environment server
302
-
303
- ```bash
304
- git clone https://github.com/Roshan5105labs/crisis-logistics-env.git
305
- cd crisis-logistics-env/crisis_logistics_env
306
- pip install -e .
307
- uvicorn server.app:app --host 0.0.0.0 --port 8000
308
- ```
309
-
310
- ### Test the environment (no LLM required)
311
-
312
- ```python
313
- from crisis_logistics_env.server.crisis_logistics_env_environment import (
314
- CrisisLogisticsEnvironment, choose_network_action
315
- )
316
-
317
- env = CrisisLogisticsEnvironment()
318
- obs = env.reset(task_id="hard")
319
- while not obs.done:
320
- obs = env.step(choose_network_action(obs))
321
- print(f"Score: {env.score:.3f}")
322
- ```
323
-
324
- ### Run the trained LLM agent
325
-
326
- ```bash
327
- # Set your HuggingFace token for Qwen-72B inference
328
- export HF_TOKEN=your_token_here
329
- python inference.py
330
- ```
331
-
332
- ### Reproduce training
333
-
334
- ```bash
335
- python train_grpo.py \
336
- --model-name "Qwen/Qwen2.5-0.5B-Instruct" \
337
- --max-steps 200 \
338
- --output-dir "outputs/logiflow-grpo-script"
339
- ```
340
-
341
- Or open the Colab notebook for a one-click reproducible run:
342
- [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
343
-
344
- ---
345
-
346
- ## API Endpoints
347
-
348
- The environment is served as a FastAPI application and is fully OpenEnv-compliant.
349
-
350
- | Endpoint | Method | Description |
351
- |----------|--------|-------------|
352
- | `/health` | GET | Returns `{"status": "healthy"}` — judges use this to verify the Space is live |
353
- | `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
354
- | `/step` | POST | Take one action. Body: `{"action": {"source_node": 2, "dest_node": 5, "shipment_volume": 18.5}}` |
355
- | `/state` | GET | Full internal state (all 12 nodes visible, no partial observability) |
356
- | `/schema` | GET | OpenAPI schema |
357
- | `/web` | GET | Live network visualizer dashboard |
358
-
359
- ---
360
-
361
- ## Project Structure
362
-
363
- ```
364
- crisis_logistics_env/
365
- ├── models.py # Action, Observation, State dataclasses
366
- ├── tasks.py # Task configs (easy / medium / hard)
367
- ├── graders.py # 7-component episode grader (0.0–1.0)
368
- ├── train_grpo.py # Production GRPO training script
369
- ├── inference.py # LLM agent loop (Qwen-72B via HF router)
370
- ├── train_and_evaluate.py # Baseline policy evaluation
371
- ├── gym_env.py # gymnasium.Env wrapper
372
- ├── client.py # HTTP client for server
373
- ├── server/
374
- │ ├── app.py # FastAPI server (7 endpoints)
375
- │ └── crisis_logistics_env_environment.py # World simulation engine
376
- ├── visualisation/
377
- │ └── logiflow_visualizer.html # Live dashboard (served at /web)
378
- ├── notebooks/
379
- │ └── logiflow_grpo_colab.ipynb # Reproducible training notebook
380
- ├── artifacts/
381
- │ ├── benchmark_summary.json # Baseline policy results
382
- │ ├── reward_curve.png # GRPO training curve
383
- │ ├── before_after_comparison.png # Policy comparison chart
384
- │ └── metrics_panel.png # Detailed metrics breakdown
385
- ├── openenv.yaml # OpenEnv manifest
386
- └── Dockerfile # HuggingFace Space deployment
387
- ```
388
-
389
- ---
390
-
391
- ## Links
392
-
393
- | Resource | Link |
394
- |----------|------|
395
- | 🤗 HuggingFace Space (live environment) | [Add your Space URL] |
396
- | 📓 Colab Training Notebook | [Add your Colab URL] |
397
- | 📝 HuggingFace Blog Post | [Add your blog URL] |
398
- | 🎥 Demo Video | [Add your YouTube URL] |
399
-
400
- ---
401
-
402
- ## Why This Matters
403
-
404
- Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
405
- The gap is not infrastructure — it is decision-making speed and anticipatory reasoning.
406
-
407
- An LLM trained on LogiFlow-RL learns to:
408
- - Read congestion signals before they become bottlenecks
409
- - Reason about partial information the way a real logistics manager would
410
- - Anticipate cascade effects from disruptions it cannot directly observe
411
- - Balance competing priorities: throughput, SLA compliance, and network stability
412
-
413
- This environment exists to teach LLMs something they currently cannot do well — and to
414
- prove that teaching is measurable.
415
-
416
- ---
417
-
418
- ## Citation
419
-
420
- ```bibtex
421
- @misc{logiflow-rl-2026,
422
- title = {LogiFlow-RL: Training LLMs for Proactive Supply Chain Crisis Management},
423
- author = {Your Name},
424
- year = {2026},
425
- howpublished = {OpenEnv Hackathon India 2026 — Theme \#2: Long-Horizon Planning},
426
- url = {https://huggingface.co/spaces/<your-space-url>}
427
- }
428
- ```
429
-
430
- ---
431
-
432
- *Submitted to the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
 
1
+ ---
2
+ title: LogiFlow-RL
3
+ emoji: "⭐"
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 8000
8
+ pinned: false
9
+ base_path: /web
10
+ ---
11
+
12
+ # LogiFlow-RL — Smart Supply Chain Crisis Management
13
+
14
+ > **Training an LLM to route shipments proactively across a 12-node global supply chain — before disruptions cascade, not after.**
15
+
16
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)](https://github.com/meta-pytorch/OpenEnv)
17
+ [![HuggingFace Space](https://img.shields.io/badge/🤗%20Space-Live-green)](https://huggingface.co/spaces/<your-space-url>)
18
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
19
+ [![Theme](https://img.shields.io/badge/Theme-Long--Horizon%20Planning-purple)](https://github.com/meta-pytorch/OpenEnv)
20
+
21
+ ---
22
+
23
+ ## The Problem
24
+
25
+ The 2021 Suez Canal blockage held up **$9 billion of goods every day**. Every major retailer missed
26
+ delivery SLAs that quarter. The root cause was not the blockage itself — it was that routing systems
27
+ identified the disruption **after** the cascade had already propagated: port backed up → upstream
28
+ warehouses overloaded → supplier shipments stalled → retail shelves empty.
29
+
30
+ Modern logistics software is fundamentally **reactive**. It alerts managers when things have already
31
+ gone wrong. What the industry needs is an agent that can read early warning signals — rising node
32
+ loads, congestion trends, disruption probability — and reroute **before** the cascade.
33
+
34
+ This is what LogiFlow-RL trains.
35
+
36
+ ---
37
+
38
+ ## What This Project Does
39
+
40
+ LogiFlow-RL is an **OpenEnv-compliant reinforcement learning environment** that simulates a
41
+ 12-node global supply chain operating under stochastic disruptions. An LLM agent is trained via
42
+ GRPO to act as a proactive logistics crisis manager: observing partial network state, reasoning
43
+ about disruption trajectories, and routing shipments to prevent overloads before they cascade.
44
+
45
+ ```
46
+ Suppliers → Warehouses → Distribution Centres → Retail Sinks
47
+ 4 → 3 → 3 → 2
48
+ ```
49
+
50
+ The environment is **genuinely hard to solve** — a round-robin baseline scores only 0.469 average
51
+ and achieves **0% SLA compliance**, because it cannot see far enough ahead to prioritise
52
+ time-sensitive shipments. Even a well-designed heuristic struggles on cascade scenarios.
53
+
54
+ ---
55
+
56
+ ## The Capability Gap Being Targeted
57
+
58
+ | What LLMs are good at today | What this environment trains |
59
+ |---|---|
60
+ | One-shot Q&A and summarisation | Multi-step sequential decisions |
61
+ | Full information, short context | Partial observability, long horizon |
62
+ | Static prompts | Dynamic world state that changes every step |
63
+ | Reactive reasoning | Anticipatory planning under uncertainty |
64
+
65
+ > **Research framing:** Could a researcher write a paper about training on this environment?
66
+ > Yes — it targets long-horizon planning under partial observability with delayed reward signals,
67
+ > a recognised capability gap in current LLM architectures.
68
+
69
+ ---
70
+
71
+ ## Environment Architecture
72
+
73
+ ### Network Topology (12 nodes, 4 tiers)
74
+
75
+ ```
76
+ [Node 0] Supplier North ──┐
77
+ [Node 1] Supplier West ──┼──► [Node 4] Warehouse Alpha ──► [Node 7] DC Metro ──► [Node 10] Retail North
78
+ [Node 2] Supplier Port ──┼──► [Node 5] Warehouse Beta ──► [Node 8] DC Central ──► ↕
79
+ [Node 3] Supplier Inland──┘──► [Node 6] Warehouse Gamma ──► [Node 9] DC Coastal ──► [Node 11] Retail South
80
+ ```
81
+
82
+ Every node has: **capacity**, **current load**, **drain rate**, **risk score**, and
83
+ **typed connections** to downstream nodes. Freight takes **2–4 steps** to transit
84
+ between nodes — the agent must plan ahead, not just react.
85
+
86
+ ### What Makes This Hard
87
+
88
+ **1. Partial observability.** The agent sees only nodes within 2 hops of the current
89
+ shipment source. Nodes beyond that radius appear as `null` in the observation. The agent
90
+ must infer hidden network state from what flows downstream — exactly like a real logistics
91
+ manager working from regional reports, not a global dashboard.
92
+
93
+ **2. Stochastic cascade disruptions.** Disruptions trigger probabilistically based on each
94
+ node's `risk_score` and the episode's `disruption_rate`. When a node is disrupted, connected
95
+ downstream nodes have a `cascade_rate` chance of also disrupting within 2 steps. These
96
+ cascades cannot be predicted or memorised — they require genuine situational reasoning.
97
+
98
+ **3. Priority-demand windows.** Certain shipments carry SLA deadlines and preferred retail
99
+ destinations. Missing a priority window is penalised proportionally to how late the delivery
100
+ arrives. The agent must balance general throughput against time-sensitive commitments.
101
+
102
+ **4. Dynamic pressure feedback.** The environment tracks a `dynamic_pressure` scalar that
103
+ combines overload ratio, SLA gap, and active disruptions. This pressure feeds back into
104
+ disruption probability and effective shipment volumes — creating a self-reinforcing difficulty
105
+ that rewards proactive management.
106
+
107
+ ### Three Difficulty Tiers
108
+
109
+ | Task | Title | Steps | Disruption Rate | Cascade Rate | Objective |
110
+ |------|-------|-------|-----------------|--------------|-----------|
111
+ | **Easy** | Regional Network Balancing | 50 | 0.05 | 0.10 | Keep utilisation balanced while moving freight to retail within SLA |
112
+ | **Medium** | Flash Sale With Port Risk | 70 | 0.09 | 0.16 | Recover from burst demand and port slowdowns; prevent warehouse spillovers |
113
+ | **Hard** | Cascading Disruption Recovery | 90 | 0.12 | 0.22 | Stabilise a partially observable chain through weather events, supplier failures, and cascade disruptions |
114
+
115
+ ### Action Space
116
+
117
+ At each step the agent receives a natural language observation and must output:
118
+
119
+ ```json
120
+ {
121
+ "reasoning": "Port 2 is trending toward congestion. Warehouse Beta has 33% buffer capacity. Routing via Beta avoids the likely cascade.",
122
+ "source_node": 2,
123
+ "dest_node": 5,
124
+ "shipment_volume": 18.5
125
+ }
126
+ ```
127
+
128
+ The `reasoning` field is not just cosmetic — it is **required** by the reward function and
129
+ is what judges and users actually see when demonstrating the trained model.
130
+
131
+ ### Observation Space
132
+
133
+ ```python
134
+ CrisisLogisticsObservation(
135
+ step_count = 14,
136
+ max_steps = 90,
137
+ visible_node_ids = [2, 5, 6, 8, 9], # 2-hop visibility only
138
+ observed_node_loads = [67.3, None, 44.1, None, 51.7, None, ...], # null = hidden
139
+ node_capacities = [90.0, None, 125.0, ...],
140
+ active_disruptions = [{"node": 2, "kind": "weather", "remaining_steps": 3}],
141
+ in_transit_shipments= [{"dest": 5, "volume": 14.2, "remaining_steps": 2}],
142
+ pending_source_node = 2,
143
+ incoming_load = 21.5,
144
+ dynamic_pressure = 0.38,
145
+ cumulative_score = 0.61,
146
+ last_reward = 0.72,
147
+ )
148
+ ```
149
+
150
+ ---
151
+
152
+ ## Reward Design
153
+
154
+ The environment uses a **7-component weighted grader** to prevent reward hacking and
155
+ ensure every aspect of logistics performance is measured independently.
156
+
157
+ ### Episode Grader (`graders.py`)
158
+
159
+ | Component | Weight | What It Measures |
160
+ |-----------|--------|-----------------|
161
+ | Bottleneck avoidance | 18% | How often any node exceeded capacity |
162
+ | Network balance | 18% | Average load-gap between most and least loaded nodes |
163
+ | Step reward | 14% | Average per-step reward across the episode |
164
+ | Retail delivery | 20% | Freight actually delivered to retail nodes vs target |
165
+ | SLA compliance | 15% | Deliveries arriving within their deadline window |
166
+ | Disruption recovery | 10% | How quickly the network stabilised after each disruption |
167
+ | Action validity | 5% | Fraction of legal (connected) routing decisions |
168
+
169
+ ### Training Reward (`action_reward` in `train_grpo.py`)
170
+
171
+ The GRPO training reward is a 5-component verifiable reward:
172
+
173
+ | Component | Max | What It Checks |
174
+ |-----------|-----|---------------|
175
+ | Valid JSON | 0.20 | Output is parseable JSON |
176
+ | Required keys | 0.20 | All 4 fields present: reasoning, source, dest, volume |
177
+ | Correct source node | 0.20 | source_node matches the episode's current shipment |
178
+ | Connected destination | 0.25 | dest_node is a legal neighbour of source_node |
179
+ | Plausible volume | 0.15 | 0 < shipment_volume ≤ 60 and close to incoming load |
180
+
181
+ ### Anti-Gaming Guards
182
+
183
+ - Reward only counts on **confirmed delivery**, not on dispatch
184
+ - **Route-repeat penalty** for consecutive identical routing decisions
185
+ - **Risk penalty** for routing through actively disrupted nodes
186
+ - **Overload penalty** applied even if JSON format is perfect
187
+ - All reward components are **independent** — gaming one does not inflate others
188
+
189
+ ---
190
+
191
+ ## Training
192
+
193
+ ### Method: SFT Warm-Up → GRPO
194
+
195
+ Training uses a two-phase approach:
196
+
197
+ **Phase 1 — SFT Warm-Up (20 steps)**
198
+ Qwen2.5-0.5B-Instruct does not reliably output valid JSON from a cold start. A brief supervised
199
+ fine-tuning step on ideal routing examples teaches the model the output format. Without this,
200
+ GRPO sees reward = 0 for most early generations and cannot learn.
201
+
202
+ **Phase 2 — GRPO (200 steps)**
203
+ Starting from the SFT checkpoint, GRPO optimises the model against the verifiable reward function.
204
+ The model generates 4 completions per prompt; GRPO compares them within the group and pushes the
205
+ model toward higher-scoring routing decisions.
206
+
207
+ ### Training Stack
208
+
209
+ ```
210
+ OpenEnv environment → live rollout prompts → TRL GRPOTrainer
211
+ + Unsloth (QLoRA r=16)
212
+ + Qwen2.5-0.5B-Instruct
213
+ ```
214
+
215
+ | Parameter | Value |
216
+ |-----------|-------|
217
+ | Base model | `Qwen/Qwen2.5-0.5B-Instruct` |
218
+ | Adapter | LoRA r=16, α=32 |
219
+ | Optimiser | GRPO via TRL |
220
+ | Max steps | 200 |
221
+ | Generations per prompt | 4 |
222
+ | Learning rate | 5e-6 |
223
+ | GPU | T4 (Colab free tier) |
224
+ | Total training time | ~45 minutes |
225
+
226
+ ---
227
+
228
+ ## Results
229
+
230
+ ### Baseline Policy Comparison
231
+
232
+ The table below shows three hand-coded baselines evaluated on all three tasks
233
+ **before any LLM training**. These are the targets the trained model must beat.
234
+
235
+ | Policy | Avg Score | Avg SLA Rate | Avg Priority Service | Avg Invalid Actions |
236
+ |--------|-----------|-------------|---------------------|---------------------|
237
+ | **Round-Robin** | 0.469 | 0.0% | 0.0% | 2.0 |
238
+ | **Heuristic** | 0.782 | 100.0% | 6.6% | 3.3 |
239
+ | **Resilient** | 0.776 | 100.0% | 4.3% | 3.0 |
240
+
241
+ **Key insight:** Round-robin achieves 0% SLA success rate despite reasonable step rewards —
242
+ because it ignores delivery deadlines entirely. Heuristic achieves 100% SLA but still
243
+ fails on priority service (6.6%) and produces invalid actions under disruption.
244
+ The trained GRPO model targets both gaps.
245
+
246
+ ### Per-Task Breakdown
247
+
248
+ | Task | Round-Robin | Heuristic | Resilient |
249
+ |------|------------|-----------|-----------|
250
+ | Easy | 0.473 | 0.768 | 0.761 |
251
+ | Medium | 0.472 | 0.763 | 0.752 |
252
+ | Hard | 0.461 | 0.814 | 0.814 |
253
+
254
+ ### Training Evidence
255
+
256
+ The reward curve below shows GRPO training progress. After the SFT warm-up,
257
+ the model starts producing valid JSON immediately and reward climbs from the first steps.
258
+
259
+ ![Reward Curve](image-1.png)
260
+ *Figure 1: GRPO training reward over 200 logging steps
261
+
262
+ ![Before vs After](artifacts/before_after_comparison.png)
263
+ *Figure 2: Policy comparison across all three task difficulties. Green bars = trained model
264
+ (after GRPO). Blue bars = base model (before GRPO). Amber bars = heuristic baseline.*
265
+
266
+ ![Metrics Panel](artifacts/metrics_panel.png)
267
+ *Figure 3: Detailed metrics breakdown — overall score, SLA rate, retail delivered, invalid
268
+ actions, and bottlenecks — for all three policies across all three tasks.*
269
+
270
+ ---
271
+
272
+ ## What the Trained Agent Thinks
273
+
274
+ Below is an example of the trained Qwen2.5-0.5B model reasoning through a hard-task
275
+ disruption scenario at step 14. This is the chain-of-thought the model produces before
276
+ taking an action:
277
+
278
+ ```
279
+ Situation: Port 2 is at 87% load with an active weather disruption (3 steps remaining).
280
+ Warehouse Beta has 44% load and 33% buffer capacity. 21.5 units incoming from Supplier Port.
281
+
282
+ Model output:
283
+ {
284
+ "reasoning": "Supplier Port (node 2) is experiencing a weather disruption with 3 steps
285
+ remaining and is near capacity at 87%. Routing through node 5 (Warehouse Beta) which
286
+ has significant buffer at 44% capacity and is not disrupted. This avoids contributing
287
+ to the congestion at node 2 and reduces cascade risk to downstream DC Coastal.",
288
+ "source_node": 2,
289
+ "dest_node": 5,
290
+ "shipment_volume": 21.5
291
+ }
292
+ ```
293
+
294
+ The heuristic would route to the nearest available node. The trained model routes to the
295
+ node that minimises cascade probability — a fundamentally different reasoning pattern.
296
+
297
+ ---
298
+
299
+ ## Running Locally
300
+
301
+ ### Start the environment server
302
+
303
+ ```bash
304
+ git clone https://github.com/Roshan5105labs/crisis-logistics-env.git
305
+ cd crisis-logistics-env/crisis_logistics_env
306
+ pip install -e .
307
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
308
+ ```
309
+
310
+ ### Test the environment (no LLM required)
311
+
312
+ ```python
313
+ from crisis_logistics_env.server.crisis_logistics_env_environment import (
314
+ CrisisLogisticsEnvironment, choose_network_action
315
+ )
316
+
317
+ env = CrisisLogisticsEnvironment()
318
+ obs = env.reset(task_id="hard")
319
+ while not obs.done:
320
+ obs = env.step(choose_network_action(obs))
321
+ print(f"Score: {env.score:.3f}")
322
+ ```
323
+
324
+ ### Run the trained LLM agent
325
+
326
+ ```bash
327
+ # Set your HuggingFace token for Qwen-72B inference
328
+ export HF_TOKEN=your_token_here
329
+ python inference.py
330
+ ```
331
+
332
+ ### Reproduce training
333
+
334
+ ```bash
335
+ python train_grpo.py \
336
+ --model-name "Qwen/Qwen2.5-0.5B-Instruct" \
337
+ --max-steps 200 \
338
+ --output-dir "outputs/logiflow-grpo-script"
339
+ ```
340
+
341
+ Or open the Colab notebook for a one-click reproducible run:
342
+ [![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
343
+
344
+ ---
345
+
346
+ ## API Endpoints
347
+
348
+ The environment is served as a FastAPI application and is fully OpenEnv-compliant.
349
+
350
+ | Endpoint | Method | Description |
351
+ |----------|--------|-------------|
352
+ | `/health` | GET | Returns `{"status": "healthy"}` — judges use this to verify the Space is live |
353
+ | `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
354
+ | `/step` | POST | Take one action. Body: `{"action": {"source_node": 2, "dest_node": 5, "shipment_volume": 18.5}}` |
355
+ | `/state` | GET | Full internal state (all 12 nodes visible, no partial observability) |
356
+ | `/schema` | GET | OpenAPI schema |
357
+ | `/web` | GET | Live network visualizer dashboard |
358
+
359
+ ---
360
+
361
+ ## Project Structure
362
+
363
+ ```
364
+ crisis_logistics_env/
365
+ ├── models.py # Action, Observation, State dataclasses
366
+ ├── tasks.py # Task configs (easy / medium / hard)
367
+ ├── graders.py # 7-component episode grader (0.0–1.0)
368
+ ├── train_grpo.py # Production GRPO training script
369
+ ├── inference.py # LLM agent loop (Qwen-72B via HF router)
370
+ ├── train_and_evaluate.py # Baseline policy evaluation
371
+ ├── gym_env.py # gymnasium.Env wrapper
372
+ ├── client.py # HTTP client for server
373
+ ├── server/
374
+ │ ├── app.py # FastAPI server (7 endpoints)
375
+ │ └── crisis_logistics_env_environment.py # World simulation engine
376
+ ├── visualisation/
377
+ │ └── logiflow_visualizer.html # Live dashboard (served at /web)
378
+ ├── notebooks/
379
+ │ └── logiflow_grpo_colab.ipynb # Reproducible training notebook
380
+ ├── artifacts/
381
+ │ ├── benchmark_summary.json # Baseline policy results
382
+ │ ├── reward_curve.png # GRPO training curve
383
+ │ ├── before_after_comparison.png # Policy comparison chart
384
+ │ └── metrics_panel.png # Detailed metrics breakdown
385
+ ├── openenv.yaml # OpenEnv manifest
386
+ └── Dockerfile # HuggingFace Space deployment
387
+ ```
388
+
389
+ ---
390
+
391
+ ## Links
392
+
393
+ | Resource | Link |
394
+ |----------|------|
395
+ | 🤗 HuggingFace Space (live environment) | [Add your Space URL] |
396
+ | 📓 Colab Training Notebook | [Add your Colab URL] |
397
+ | 📝 HuggingFace Blog Post | [Add your blog URL] |
398
+ | 🎥 Demo Video | [Add your YouTube URL] |
399
+
400
+ ---
401
+
402
+ ## Why This Matters
403
+
404
+ Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
405
+ The gap is not infrastructure — it is decision-making speed and anticipatory reasoning.
406
+
407
+ An LLM trained on LogiFlow-RL learns to:
408
+ - Read congestion signals before they become bottlenecks
409
+ - Reason about partial information the way a real logistics manager would
410
+ - Anticipate cascade effects from disruptions it cannot directly observe
411
+ - Balance competing priorities: throughput, SLA compliance, and network stability
412
+
413
+ This environment exists to teach LLMs something they currently cannot do well — and to
414
+ prove that teaching is measurable.
415
+
416
+ ---
417
+
418
+ ## Citation
419
+
420
+ ```bibtex
421
+ @misc{logiflow-rl-2026,
422
+ title = {LogiFlow-RL: Training LLMs for Proactive Supply Chain Crisis Management},
423
+ author = {Your Name},
424
+ year = {2026},
425
+ howpublished = {OpenEnv Hackathon India 2026 — Theme \#2: Long-Horizon Planning},
426
+ url = {https://huggingface.co/spaces/<your-space-url>}
427
+ }
428
+ ```
429
+
430
+ ---
431
+
432
+ *Submitted to the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
server/app.py CHANGED
@@ -12,7 +12,7 @@ from pathlib import Path
12
  from typing import Any, Dict, Literal, Optional
13
 
14
  from fastapi import FastAPI, HTTPException
15
- from fastapi.responses import HTMLResponse
16
  from pydantic import BaseModel
17
 
18
  try:
@@ -219,6 +219,17 @@ async def web_landing() -> HTMLResponse:
219
  return HTMLResponse(_read_visualizer_html())
220
 
221
 
 
 
 
 
 
 
 
 
 
 
 
222
  @app.get("/visualizer", response_class=HTMLResponse, tags=["Environment Info"])
223
  async def visualizer() -> HTMLResponse:
224
  return HTMLResponse(_read_visualizer_html())
 
12
  from typing import Any, Dict, Literal, Optional
13
 
14
  from fastapi import FastAPI, HTTPException
15
+ from fastapi.responses import HTMLResponse, RedirectResponse
16
  from pydantic import BaseModel
17
 
18
  try:
 
219
  return HTMLResponse(_read_visualizer_html())
220
 
221
 
222
+ @app.get("/web/", response_class=HTMLResponse, tags=["Environment Info"])
223
+ async def web_landing_slash() -> HTMLResponse:
224
+ return HTMLResponse(_read_visualizer_html())
225
+
226
+
227
+ @app.get("/server", include_in_schema=False)
228
+ async def server_compat() -> RedirectResponse:
229
+ """Compatibility route used by some deployment templates."""
230
+ return RedirectResponse(url="/web")
231
+
232
+
233
  @app.get("/visualizer", response_class=HTMLResponse, tags=["Environment Info"])
234
  async def visualizer() -> HTMLResponse:
235
  return HTMLResponse(_read_visualizer_html())