Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +432 -432
- server/app.py +12 -1
README.md
CHANGED
|
@@ -1,432 +1,432 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: LogiFlow-RL
|
| 3 |
-
emoji: "⭐"
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: green
|
| 6 |
-
sdk: docker
|
| 7 |
-
app_port: 8000
|
| 8 |
-
pinned: false
|
| 9 |
-
base_path: /web
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
# LogiFlow-RL — Smart Supply Chain Crisis Management
|
| 13 |
-
|
| 14 |
-
> **Training an LLM to route shipments proactively across a 12-node global supply chain — before disruptions cascade, not after.**
|
| 15 |
-
|
| 16 |
-
[](https://github.com/meta-pytorch/OpenEnv)
|
| 17 |
-
[](https://huggingface.co/spaces/<your-space-url>)
|
| 18 |
-
[](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
|
| 19 |
-
[](https://github.com/meta-pytorch/OpenEnv)
|
| 20 |
-
|
| 21 |
-
---
|
| 22 |
-
|
| 23 |
-
## The Problem
|
| 24 |
-
|
| 25 |
-
The 2021 Suez Canal blockage held up **$9 billion of goods every day**. Every major retailer missed
|
| 26 |
-
delivery SLAs that quarter. The root cause was not the blockage itself — it was that routing systems
|
| 27 |
-
identified the disruption **after** the cascade had already propagated: port backed up → upstream
|
| 28 |
-
warehouses overloaded → supplier shipments stalled → retail shelves empty.
|
| 29 |
-
|
| 30 |
-
Modern logistics software is fundamentally **reactive**. It alerts managers when things have already
|
| 31 |
-
gone wrong. What the industry needs is an agent that can read early warning signals — rising node
|
| 32 |
-
loads, congestion trends, disruption probability — and reroute **before** the cascade.
|
| 33 |
-
|
| 34 |
-
This is what LogiFlow-RL trains.
|
| 35 |
-
|
| 36 |
-
---
|
| 37 |
-
|
| 38 |
-
## What This Project Does
|
| 39 |
-
|
| 40 |
-
LogiFlow-RL is an **OpenEnv-compliant reinforcement learning environment** that simulates a
|
| 41 |
-
12-node global supply chain operating under stochastic disruptions. An LLM agent is trained via
|
| 42 |
-
GRPO to act as a proactive logistics crisis manager: observing partial network state, reasoning
|
| 43 |
-
about disruption trajectories, and routing shipments to prevent overloads before they cascade.
|
| 44 |
-
|
| 45 |
-
```
|
| 46 |
-
Suppliers → Warehouses → Distribution Centres → Retail Sinks
|
| 47 |
-
4 → 3 → 3 → 2
|
| 48 |
-
```
|
| 49 |
-
|
| 50 |
-
The environment is **genuinely hard to solve** — a round-robin baseline scores only 0.469 average
|
| 51 |
-
and achieves **0% SLA compliance**, because it cannot see far enough ahead to prioritise
|
| 52 |
-
time-sensitive shipments. Even a well-designed heuristic struggles on cascade scenarios.
|
| 53 |
-
|
| 54 |
-
---
|
| 55 |
-
|
| 56 |
-
## The Capability Gap Being Targeted
|
| 57 |
-
|
| 58 |
-
| What LLMs are good at today | What this environment trains |
|
| 59 |
-
|---|---|
|
| 60 |
-
| One-shot Q&A and summarisation | Multi-step sequential decisions |
|
| 61 |
-
| Full information, short context | Partial observability, long horizon |
|
| 62 |
-
| Static prompts | Dynamic world state that changes every step |
|
| 63 |
-
| Reactive reasoning | Anticipatory planning under uncertainty |
|
| 64 |
-
|
| 65 |
-
> **Research framing:** Could a researcher write a paper about training on this environment?
|
| 66 |
-
> Yes — it targets long-horizon planning under partial observability with delayed reward signals,
|
| 67 |
-
> a recognised capability gap in current LLM architectures.
|
| 68 |
-
|
| 69 |
-
---
|
| 70 |
-
|
| 71 |
-
## Environment Architecture
|
| 72 |
-
|
| 73 |
-
### Network Topology (12 nodes, 4 tiers)
|
| 74 |
-
|
| 75 |
-
```
|
| 76 |
-
[Node 0] Supplier North ──┐
|
| 77 |
-
[Node 1] Supplier West ──┼──► [Node 4] Warehouse Alpha ──► [Node 7] DC Metro ──► [Node 10] Retail North
|
| 78 |
-
[Node 2] Supplier Port ──┼──► [Node 5] Warehouse Beta ──► [Node 8] DC Central ──► ↕
|
| 79 |
-
[Node 3] Supplier Inland──┘──► [Node 6] Warehouse Gamma ──► [Node 9] DC Coastal ──► [Node 11] Retail South
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
Every node has: **capacity**, **current load**, **drain rate**, **risk score**, and
|
| 83 |
-
**typed connections** to downstream nodes. Freight takes **2–4 steps** to transit
|
| 84 |
-
between nodes — the agent must plan ahead, not just react.
|
| 85 |
-
|
| 86 |
-
### What Makes This Hard
|
| 87 |
-
|
| 88 |
-
**1. Partial observability.** The agent sees only nodes within 2 hops of the current
|
| 89 |
-
shipment source. Nodes beyond that radius appear as `null` in the observation. The agent
|
| 90 |
-
must infer hidden network state from what flows downstream — exactly like a real logistics
|
| 91 |
-
manager working from regional reports, not a global dashboard.
|
| 92 |
-
|
| 93 |
-
**2. Stochastic cascade disruptions.** Disruptions trigger probabilistically based on each
|
| 94 |
-
node's `risk_score` and the episode's `disruption_rate`. When a node is disrupted, connected
|
| 95 |
-
downstream nodes have a `cascade_rate` chance of also disrupting within 2 steps. These
|
| 96 |
-
cascades cannot be predicted or memorised — they require genuine situational reasoning.
|
| 97 |
-
|
| 98 |
-
**3. Priority-demand windows.** Certain shipments carry SLA deadlines and preferred retail
|
| 99 |
-
destinations. Missing a priority window is penalised proportionally to how late the delivery
|
| 100 |
-
arrives. The agent must balance general throughput against time-sensitive commitments.
|
| 101 |
-
|
| 102 |
-
**4. Dynamic pressure feedback.** The environment tracks a `dynamic_pressure` scalar that
|
| 103 |
-
combines overload ratio, SLA gap, and active disruptions. This pressure feeds back into
|
| 104 |
-
disruption probability and effective shipment volumes — creating a self-reinforcing difficulty
|
| 105 |
-
that rewards proactive management.
|
| 106 |
-
|
| 107 |
-
### Three Difficulty Tiers
|
| 108 |
-
|
| 109 |
-
| Task | Title | Steps | Disruption Rate | Cascade Rate | Objective |
|
| 110 |
-
|------|-------|-------|-----------------|--------------|-----------|
|
| 111 |
-
| **Easy** | Regional Network Balancing | 50 | 0.05 | 0.10 | Keep utilisation balanced while moving freight to retail within SLA |
|
| 112 |
-
| **Medium** | Flash Sale With Port Risk | 70 | 0.09 | 0.16 | Recover from burst demand and port slowdowns; prevent warehouse spillovers |
|
| 113 |
-
| **Hard** | Cascading Disruption Recovery | 90 | 0.12 | 0.22 | Stabilise a partially observable chain through weather events, supplier failures, and cascade disruptions |
|
| 114 |
-
|
| 115 |
-
### Action Space
|
| 116 |
-
|
| 117 |
-
At each step the agent receives a natural language observation and must output:
|
| 118 |
-
|
| 119 |
-
```json
|
| 120 |
-
{
|
| 121 |
-
"reasoning": "Port 2 is trending toward congestion. Warehouse Beta has 33% buffer capacity. Routing via Beta avoids the likely cascade.",
|
| 122 |
-
"source_node": 2,
|
| 123 |
-
"dest_node": 5,
|
| 124 |
-
"shipment_volume": 18.5
|
| 125 |
-
}
|
| 126 |
-
```
|
| 127 |
-
|
| 128 |
-
The `reasoning` field is not just cosmetic — it is **required** by the reward function and
|
| 129 |
-
is what judges and users actually see when demonstrating the trained model.
|
| 130 |
-
|
| 131 |
-
### Observation Space
|
| 132 |
-
|
| 133 |
-
```python
|
| 134 |
-
CrisisLogisticsObservation(
|
| 135 |
-
step_count = 14,
|
| 136 |
-
max_steps = 90,
|
| 137 |
-
visible_node_ids = [2, 5, 6, 8, 9], # 2-hop visibility only
|
| 138 |
-
observed_node_loads = [67.3, None, 44.1, None, 51.7, None, ...], # null = hidden
|
| 139 |
-
node_capacities = [90.0, None, 125.0, ...],
|
| 140 |
-
active_disruptions = [{"node": 2, "kind": "weather", "remaining_steps": 3}],
|
| 141 |
-
in_transit_shipments= [{"dest": 5, "volume": 14.2, "remaining_steps": 2}],
|
| 142 |
-
pending_source_node = 2,
|
| 143 |
-
incoming_load = 21.5,
|
| 144 |
-
dynamic_pressure = 0.38,
|
| 145 |
-
cumulative_score = 0.61,
|
| 146 |
-
last_reward = 0.72,
|
| 147 |
-
)
|
| 148 |
-
```
|
| 149 |
-
|
| 150 |
-
---
|
| 151 |
-
|
| 152 |
-
## Reward Design
|
| 153 |
-
|
| 154 |
-
The environment uses a **7-component weighted grader** to prevent reward hacking and
|
| 155 |
-
ensure every aspect of logistics performance is measured independently.
|
| 156 |
-
|
| 157 |
-
### Episode Grader (`graders.py`)
|
| 158 |
-
|
| 159 |
-
| Component | Weight | What It Measures |
|
| 160 |
-
|-----------|--------|-----------------|
|
| 161 |
-
| Bottleneck avoidance | 18% | How often any node exceeded capacity |
|
| 162 |
-
| Network balance | 18% | Average load-gap between most and least loaded nodes |
|
| 163 |
-
| Step reward | 14% | Average per-step reward across the episode |
|
| 164 |
-
| Retail delivery | 20% | Freight actually delivered to retail nodes vs target |
|
| 165 |
-
| SLA compliance | 15% | Deliveries arriving within their deadline window |
|
| 166 |
-
| Disruption recovery | 10% | How quickly the network stabilised after each disruption |
|
| 167 |
-
| Action validity | 5% | Fraction of legal (connected) routing decisions |
|
| 168 |
-
|
| 169 |
-
### Training Reward (`action_reward` in `train_grpo.py`)
|
| 170 |
-
|
| 171 |
-
The GRPO training reward is a 5-component verifiable reward:
|
| 172 |
-
|
| 173 |
-
| Component | Max | What It Checks |
|
| 174 |
-
|-----------|-----|---------------|
|
| 175 |
-
| Valid JSON | 0.20 | Output is parseable JSON |
|
| 176 |
-
| Required keys | 0.20 | All 4 fields present: reasoning, source, dest, volume |
|
| 177 |
-
| Correct source node | 0.20 | source_node matches the episode's current shipment |
|
| 178 |
-
| Connected destination | 0.25 | dest_node is a legal neighbour of source_node |
|
| 179 |
-
| Plausible volume | 0.15 | 0 < shipment_volume ≤ 60 and close to incoming load |
|
| 180 |
-
|
| 181 |
-
### Anti-Gaming Guards
|
| 182 |
-
|
| 183 |
-
- Reward only counts on **confirmed delivery**, not on dispatch
|
| 184 |
-
- **Route-repeat penalty** for consecutive identical routing decisions
|
| 185 |
-
- **Risk penalty** for routing through actively disrupted nodes
|
| 186 |
-
- **Overload penalty** applied even if JSON format is perfect
|
| 187 |
-
- All reward components are **independent** — gaming one does not inflate others
|
| 188 |
-
|
| 189 |
-
---
|
| 190 |
-
|
| 191 |
-
## Training
|
| 192 |
-
|
| 193 |
-
### Method: SFT Warm-Up → GRPO
|
| 194 |
-
|
| 195 |
-
Training uses a two-phase approach:
|
| 196 |
-
|
| 197 |
-
**Phase 1 — SFT Warm-Up (20 steps)**
|
| 198 |
-
Qwen2.5-0.5B-Instruct does not reliably output valid JSON from a cold start. A brief supervised
|
| 199 |
-
fine-tuning step on ideal routing examples teaches the model the output format. Without this,
|
| 200 |
-
GRPO sees reward = 0 for most early generations and cannot learn.
|
| 201 |
-
|
| 202 |
-
**Phase 2 — GRPO (200 steps)**
|
| 203 |
-
Starting from the SFT checkpoint, GRPO optimises the model against the verifiable reward function.
|
| 204 |
-
The model generates 4 completions per prompt; GRPO compares them within the group and pushes the
|
| 205 |
-
model toward higher-scoring routing decisions.
|
| 206 |
-
|
| 207 |
-
### Training Stack
|
| 208 |
-
|
| 209 |
-
```
|
| 210 |
-
OpenEnv environment → live rollout prompts → TRL GRPOTrainer
|
| 211 |
-
+ Unsloth (QLoRA r=16)
|
| 212 |
-
+ Qwen2.5-0.5B-Instruct
|
| 213 |
-
```
|
| 214 |
-
|
| 215 |
-
| Parameter | Value |
|
| 216 |
-
|-----------|-------|
|
| 217 |
-
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` |
|
| 218 |
-
| Adapter | LoRA r=16, α=32 |
|
| 219 |
-
| Optimiser | GRPO via TRL |
|
| 220 |
-
| Max steps | 200 |
|
| 221 |
-
| Generations per prompt | 4 |
|
| 222 |
-
| Learning rate | 5e-6 |
|
| 223 |
-
| GPU | T4 (Colab free tier) |
|
| 224 |
-
| Total training time | ~45 minutes |
|
| 225 |
-
|
| 226 |
-
---
|
| 227 |
-
|
| 228 |
-
## Results
|
| 229 |
-
|
| 230 |
-
### Baseline Policy Comparison
|
| 231 |
-
|
| 232 |
-
The table below shows three hand-coded baselines evaluated on all three tasks
|
| 233 |
-
**before any LLM training**. These are the targets the trained model must beat.
|
| 234 |
-
|
| 235 |
-
| Policy | Avg Score | Avg SLA Rate | Avg Priority Service | Avg Invalid Actions |
|
| 236 |
-
|--------|-----------|-------------|---------------------|---------------------|
|
| 237 |
-
| **Round-Robin** | 0.469 | 0.0% | 0.0% | 2.0 |
|
| 238 |
-
| **Heuristic** | 0.782 | 100.0% | 6.6% | 3.3 |
|
| 239 |
-
| **Resilient** | 0.776 | 100.0% | 4.3% | 3.0 |
|
| 240 |
-
|
| 241 |
-
**Key insight:** Round-robin achieves 0% SLA success rate despite reasonable step rewards —
|
| 242 |
-
because it ignores delivery deadlines entirely. Heuristic achieves 100% SLA but still
|
| 243 |
-
fails on priority service (6.6%) and produces invalid actions under disruption.
|
| 244 |
-
The trained GRPO model targets both gaps.
|
| 245 |
-
|
| 246 |
-
### Per-Task Breakdown
|
| 247 |
-
|
| 248 |
-
| Task | Round-Robin | Heuristic | Resilient |
|
| 249 |
-
|------|------------|-----------|-----------|
|
| 250 |
-
| Easy | 0.473 | 0.768 | 0.761 |
|
| 251 |
-
| Medium | 0.472 | 0.763 | 0.752 |
|
| 252 |
-
| Hard | 0.461 | 0.814 | 0.814 |
|
| 253 |
-
|
| 254 |
-
### Training Evidence
|
| 255 |
-
|
| 256 |
-
The reward curve below shows GRPO training progress. After the SFT warm-up,
|
| 257 |
-
the model starts producing valid JSON immediately and reward climbs from the first steps.
|
| 258 |
-
|
| 259 |
-

|
| 260 |
-
*Figure 1: GRPO training reward over 200 logging steps
|
| 261 |
-
|
| 262 |
-

|
| 263 |
-
*Figure 2: Policy comparison across all three task difficulties. Green bars = trained model
|
| 264 |
-
(after GRPO). Blue bars = base model (before GRPO). Amber bars = heuristic baseline.*
|
| 265 |
-
|
| 266 |
-

|
| 267 |
-
*Figure 3: Detailed metrics breakdown — overall score, SLA rate, retail delivered, invalid
|
| 268 |
-
actions, and bottlenecks — for all three policies across all three tasks.*
|
| 269 |
-
|
| 270 |
-
---
|
| 271 |
-
|
| 272 |
-
## What the Trained Agent Thinks
|
| 273 |
-
|
| 274 |
-
Below is an example of the trained Qwen2.5-0.5B model reasoning through a hard-task
|
| 275 |
-
disruption scenario at step 14. This is the chain-of-thought the model produces before
|
| 276 |
-
taking an action:
|
| 277 |
-
|
| 278 |
-
```
|
| 279 |
-
Situation: Port 2 is at 87% load with an active weather disruption (3 steps remaining).
|
| 280 |
-
Warehouse Beta has 44% load and 33% buffer capacity. 21.5 units incoming from Supplier Port.
|
| 281 |
-
|
| 282 |
-
Model output:
|
| 283 |
-
{
|
| 284 |
-
"reasoning": "Supplier Port (node 2) is experiencing a weather disruption with 3 steps
|
| 285 |
-
remaining and is near capacity at 87%. Routing through node 5 (Warehouse Beta) which
|
| 286 |
-
has significant buffer at 44% capacity and is not disrupted. This avoids contributing
|
| 287 |
-
to the congestion at node 2 and reduces cascade risk to downstream DC Coastal.",
|
| 288 |
-
"source_node": 2,
|
| 289 |
-
"dest_node": 5,
|
| 290 |
-
"shipment_volume": 21.5
|
| 291 |
-
}
|
| 292 |
-
```
|
| 293 |
-
|
| 294 |
-
The heuristic would route to the nearest available node. The trained model routes to the
|
| 295 |
-
node that minimises cascade probability — a fundamentally different reasoning pattern.
|
| 296 |
-
|
| 297 |
-
---
|
| 298 |
-
|
| 299 |
-
## Running Locally
|
| 300 |
-
|
| 301 |
-
### Start the environment server
|
| 302 |
-
|
| 303 |
-
```bash
|
| 304 |
-
git clone https://github.com/Roshan5105labs/crisis-logistics-env.git
|
| 305 |
-
cd crisis-logistics-env/crisis_logistics_env
|
| 306 |
-
pip install -e .
|
| 307 |
-
uvicorn server.app:app --host 0.0.0.0 --port 8000
|
| 308 |
-
```
|
| 309 |
-
|
| 310 |
-
### Test the environment (no LLM required)
|
| 311 |
-
|
| 312 |
-
```python
|
| 313 |
-
from crisis_logistics_env.server.crisis_logistics_env_environment import (
|
| 314 |
-
CrisisLogisticsEnvironment, choose_network_action
|
| 315 |
-
)
|
| 316 |
-
|
| 317 |
-
env = CrisisLogisticsEnvironment()
|
| 318 |
-
obs = env.reset(task_id="hard")
|
| 319 |
-
while not obs.done:
|
| 320 |
-
obs = env.step(choose_network_action(obs))
|
| 321 |
-
print(f"Score: {env.score:.3f}")
|
| 322 |
-
```
|
| 323 |
-
|
| 324 |
-
### Run the trained LLM agent
|
| 325 |
-
|
| 326 |
-
```bash
|
| 327 |
-
# Set your HuggingFace token for Qwen-72B inference
|
| 328 |
-
export HF_TOKEN=your_token_here
|
| 329 |
-
python inference.py
|
| 330 |
-
```
|
| 331 |
-
|
| 332 |
-
### Reproduce training
|
| 333 |
-
|
| 334 |
-
```bash
|
| 335 |
-
python train_grpo.py \
|
| 336 |
-
--model-name "Qwen/Qwen2.5-0.5B-Instruct" \
|
| 337 |
-
--max-steps 200 \
|
| 338 |
-
--output-dir "outputs/logiflow-grpo-script"
|
| 339 |
-
```
|
| 340 |
-
|
| 341 |
-
Or open the Colab notebook for a one-click reproducible run:
|
| 342 |
-
[](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
|
| 343 |
-
|
| 344 |
-
---
|
| 345 |
-
|
| 346 |
-
## API Endpoints
|
| 347 |
-
|
| 348 |
-
The environment is served as a FastAPI application and is fully OpenEnv-compliant.
|
| 349 |
-
|
| 350 |
-
| Endpoint | Method | Description |
|
| 351 |
-
|----------|--------|-------------|
|
| 352 |
-
| `/health` | GET | Returns `{"status": "healthy"}` — judges use this to verify the Space is live |
|
| 353 |
-
| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
|
| 354 |
-
| `/step` | POST | Take one action. Body: `{"action": {"source_node": 2, "dest_node": 5, "shipment_volume": 18.5}}` |
|
| 355 |
-
| `/state` | GET | Full internal state (all 12 nodes visible, no partial observability) |
|
| 356 |
-
| `/schema` | GET | OpenAPI schema |
|
| 357 |
-
| `/web` | GET | Live network visualizer dashboard |
|
| 358 |
-
|
| 359 |
-
---
|
| 360 |
-
|
| 361 |
-
## Project Structure
|
| 362 |
-
|
| 363 |
-
```
|
| 364 |
-
crisis_logistics_env/
|
| 365 |
-
├── models.py # Action, Observation, State dataclasses
|
| 366 |
-
├── tasks.py # Task configs (easy / medium / hard)
|
| 367 |
-
├── graders.py # 7-component episode grader (0.0–1.0)
|
| 368 |
-
├── train_grpo.py # Production GRPO training script
|
| 369 |
-
├── inference.py # LLM agent loop (Qwen-72B via HF router)
|
| 370 |
-
├── train_and_evaluate.py # Baseline policy evaluation
|
| 371 |
-
├── gym_env.py # gymnasium.Env wrapper
|
| 372 |
-
├── client.py # HTTP client for server
|
| 373 |
-
├── server/
|
| 374 |
-
│ ├── app.py # FastAPI server (7 endpoints)
|
| 375 |
-
│ └── crisis_logistics_env_environment.py # World simulation engine
|
| 376 |
-
├── visualisation/
|
| 377 |
-
│ └── logiflow_visualizer.html # Live dashboard (served at /web)
|
| 378 |
-
├── notebooks/
|
| 379 |
-
│ └── logiflow_grpo_colab.ipynb # Reproducible training notebook
|
| 380 |
-
├── artifacts/
|
| 381 |
-
│ ├── benchmark_summary.json # Baseline policy results
|
| 382 |
-
│ ├── reward_curve.png # GRPO training curve
|
| 383 |
-
│ ├── before_after_comparison.png # Policy comparison chart
|
| 384 |
-
│ └── metrics_panel.png # Detailed metrics breakdown
|
| 385 |
-
├── openenv.yaml # OpenEnv manifest
|
| 386 |
-
└── Dockerfile # HuggingFace Space deployment
|
| 387 |
-
```
|
| 388 |
-
|
| 389 |
-
---
|
| 390 |
-
|
| 391 |
-
## Links
|
| 392 |
-
|
| 393 |
-
| Resource | Link |
|
| 394 |
-
|----------|------|
|
| 395 |
-
| 🤗 HuggingFace Space (live environment) | [Add your Space URL] |
|
| 396 |
-
| 📓 Colab Training Notebook | [Add your Colab URL] |
|
| 397 |
-
| 📝 HuggingFace Blog Post | [Add your blog URL] |
|
| 398 |
-
| 🎥 Demo Video | [Add your YouTube URL] |
|
| 399 |
-
|
| 400 |
-
---
|
| 401 |
-
|
| 402 |
-
## Why This Matters
|
| 403 |
-
|
| 404 |
-
Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
|
| 405 |
-
The gap is not infrastructure — it is decision-making speed and anticipatory reasoning.
|
| 406 |
-
|
| 407 |
-
An LLM trained on LogiFlow-RL learns to:
|
| 408 |
-
- Read congestion signals before they become bottlenecks
|
| 409 |
-
- Reason about partial information the way a real logistics manager would
|
| 410 |
-
- Anticipate cascade effects from disruptions it cannot directly observe
|
| 411 |
-
- Balance competing priorities: throughput, SLA compliance, and network stability
|
| 412 |
-
|
| 413 |
-
This environment exists to teach LLMs something they currently cannot do well — and to
|
| 414 |
-
prove that teaching is measurable.
|
| 415 |
-
|
| 416 |
-
---
|
| 417 |
-
|
| 418 |
-
## Citation
|
| 419 |
-
|
| 420 |
-
```bibtex
|
| 421 |
-
@misc{logiflow-rl-2026,
|
| 422 |
-
title = {LogiFlow-RL: Training LLMs for Proactive Supply Chain Crisis Management},
|
| 423 |
-
author = {Your Name},
|
| 424 |
-
year = {2026},
|
| 425 |
-
howpublished = {OpenEnv Hackathon India 2026 — Theme \#2: Long-Horizon Planning},
|
| 426 |
-
url = {https://huggingface.co/spaces/<your-space-url>}
|
| 427 |
-
}
|
| 428 |
-
```
|
| 429 |
-
|
| 430 |
-
---
|
| 431 |
-
|
| 432 |
-
*Submitted to the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: LogiFlow-RL
|
| 3 |
+
emoji: "⭐"
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: green
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 8000
|
| 8 |
+
pinned: false
|
| 9 |
+
base_path: /web
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# LogiFlow-RL — Smart Supply Chain Crisis Management
|
| 13 |
+
|
| 14 |
+
> **Training an LLM to route shipments proactively across a 12-node global supply chain — before disruptions cascade, not after.**
|
| 15 |
+
|
| 16 |
+
[](https://github.com/meta-pytorch/OpenEnv)
|
| 17 |
+
[](https://huggingface.co/spaces/<your-space-url>)
|
| 18 |
+
[](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
|
| 19 |
+
[](https://github.com/meta-pytorch/OpenEnv)
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## The Problem
|
| 24 |
+
|
| 25 |
+
The 2021 Suez Canal blockage held up **$9 billion of goods every day**. Every major retailer missed
|
| 26 |
+
delivery SLAs that quarter. The root cause was not the blockage itself — it was that routing systems
|
| 27 |
+
identified the disruption **after** the cascade had already propagated: port backed up → upstream
|
| 28 |
+
warehouses overloaded → supplier shipments stalled → retail shelves empty.
|
| 29 |
+
|
| 30 |
+
Modern logistics software is fundamentally **reactive**. It alerts managers when things have already
|
| 31 |
+
gone wrong. What the industry needs is an agent that can read early warning signals — rising node
|
| 32 |
+
loads, congestion trends, disruption probability — and reroute **before** the cascade.
|
| 33 |
+
|
| 34 |
+
This is what LogiFlow-RL trains.
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## What This Project Does
|
| 39 |
+
|
| 40 |
+
LogiFlow-RL is an **OpenEnv-compliant reinforcement learning environment** that simulates a
|
| 41 |
+
12-node global supply chain operating under stochastic disruptions. An LLM agent is trained via
|
| 42 |
+
GRPO to act as a proactive logistics crisis manager: observing partial network state, reasoning
|
| 43 |
+
about disruption trajectories, and routing shipments to prevent overloads before they cascade.
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
Suppliers → Warehouses → Distribution Centres → Retail Sinks
|
| 47 |
+
4 → 3 → 3 → 2
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
The environment is **genuinely hard to solve** — a round-robin baseline scores only 0.469 average
|
| 51 |
+
and achieves **0% SLA compliance**, because it cannot see far enough ahead to prioritise
|
| 52 |
+
time-sensitive shipments. Even a well-designed heuristic struggles on cascade scenarios.
|
| 53 |
+
|
| 54 |
+
---
|
| 55 |
+
|
| 56 |
+
## The Capability Gap Being Targeted
|
| 57 |
+
|
| 58 |
+
| What LLMs are good at today | What this environment trains |
|
| 59 |
+
|---|---|
|
| 60 |
+
| One-shot Q&A and summarisation | Multi-step sequential decisions |
|
| 61 |
+
| Full information, short context | Partial observability, long horizon |
|
| 62 |
+
| Static prompts | Dynamic world state that changes every step |
|
| 63 |
+
| Reactive reasoning | Anticipatory planning under uncertainty |
|
| 64 |
+
|
| 65 |
+
> **Research framing:** Could a researcher write a paper about training on this environment?
|
| 66 |
+
> Yes — it targets long-horizon planning under partial observability with delayed reward signals,
|
| 67 |
+
> a recognised capability gap in current LLM architectures.
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## Environment Architecture
|
| 72 |
+
|
| 73 |
+
### Network Topology (12 nodes, 4 tiers)
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
[Node 0] Supplier North ──┐
|
| 77 |
+
[Node 1] Supplier West ──┼──► [Node 4] Warehouse Alpha ──► [Node 7] DC Metro ──► [Node 10] Retail North
|
| 78 |
+
[Node 2] Supplier Port ──┼──► [Node 5] Warehouse Beta ──► [Node 8] DC Central ──► ↕
|
| 79 |
+
[Node 3] Supplier Inland──┘──► [Node 6] Warehouse Gamma ──► [Node 9] DC Coastal ──► [Node 11] Retail South
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
Every node has: **capacity**, **current load**, **drain rate**, **risk score**, and
|
| 83 |
+
**typed connections** to downstream nodes. Freight takes **2–4 steps** to transit
|
| 84 |
+
between nodes — the agent must plan ahead, not just react.
|
| 85 |
+
|
| 86 |
+
### What Makes This Hard
|
| 87 |
+
|
| 88 |
+
**1. Partial observability.** The agent sees only nodes within 2 hops of the current
|
| 89 |
+
shipment source. Nodes beyond that radius appear as `null` in the observation. The agent
|
| 90 |
+
must infer hidden network state from what flows downstream — exactly like a real logistics
|
| 91 |
+
manager working from regional reports, not a global dashboard.
|
| 92 |
+
|
| 93 |
+
**2. Stochastic cascade disruptions.** Disruptions trigger probabilistically based on each
|
| 94 |
+
node's `risk_score` and the episode's `disruption_rate`. When a node is disrupted, connected
|
| 95 |
+
downstream nodes have a `cascade_rate` chance of also disrupting within 2 steps. These
|
| 96 |
+
cascades cannot be predicted or memorised — they require genuine situational reasoning.
|
| 97 |
+
|
| 98 |
+
**3. Priority-demand windows.** Certain shipments carry SLA deadlines and preferred retail
|
| 99 |
+
destinations. Missing a priority window is penalised proportionally to how late the delivery
|
| 100 |
+
arrives. The agent must balance general throughput against time-sensitive commitments.
|
| 101 |
+
|
| 102 |
+
**4. Dynamic pressure feedback.** The environment tracks a `dynamic_pressure` scalar that
|
| 103 |
+
combines overload ratio, SLA gap, and active disruptions. This pressure feeds back into
|
| 104 |
+
disruption probability and effective shipment volumes — creating a self-reinforcing difficulty
|
| 105 |
+
that rewards proactive management.
|
| 106 |
+
|
| 107 |
+
### Three Difficulty Tiers
|
| 108 |
+
|
| 109 |
+
| Task | Title | Steps | Disruption Rate | Cascade Rate | Objective |
|
| 110 |
+
|------|-------|-------|-----------------|--------------|-----------|
|
| 111 |
+
| **Easy** | Regional Network Balancing | 50 | 0.05 | 0.10 | Keep utilisation balanced while moving freight to retail within SLA |
|
| 112 |
+
| **Medium** | Flash Sale With Port Risk | 70 | 0.09 | 0.16 | Recover from burst demand and port slowdowns; prevent warehouse spillovers |
|
| 113 |
+
| **Hard** | Cascading Disruption Recovery | 90 | 0.12 | 0.22 | Stabilise a partially observable chain through weather events, supplier failures, and cascade disruptions |
|
| 114 |
+
|
| 115 |
+
### Action Space
|
| 116 |
+
|
| 117 |
+
At each step the agent receives a natural language observation and must output:
|
| 118 |
+
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"reasoning": "Port 2 is trending toward congestion. Warehouse Beta has 33% buffer capacity. Routing via Beta avoids the likely cascade.",
|
| 122 |
+
"source_node": 2,
|
| 123 |
+
"dest_node": 5,
|
| 124 |
+
"shipment_volume": 18.5
|
| 125 |
+
}
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
The `reasoning` field is not just cosmetic — it is **required** by the reward function and
|
| 129 |
+
is what judges and users actually see when demonstrating the trained model.
|
| 130 |
+
|
| 131 |
+
### Observation Space
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
CrisisLogisticsObservation(
|
| 135 |
+
step_count = 14,
|
| 136 |
+
max_steps = 90,
|
| 137 |
+
visible_node_ids = [2, 5, 6, 8, 9], # 2-hop visibility only
|
| 138 |
+
observed_node_loads = [67.3, None, 44.1, None, 51.7, None, ...], # null = hidden
|
| 139 |
+
node_capacities = [90.0, None, 125.0, ...],
|
| 140 |
+
active_disruptions = [{"node": 2, "kind": "weather", "remaining_steps": 3}],
|
| 141 |
+
in_transit_shipments= [{"dest": 5, "volume": 14.2, "remaining_steps": 2}],
|
| 142 |
+
pending_source_node = 2,
|
| 143 |
+
incoming_load = 21.5,
|
| 144 |
+
dynamic_pressure = 0.38,
|
| 145 |
+
cumulative_score = 0.61,
|
| 146 |
+
last_reward = 0.72,
|
| 147 |
+
)
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
---
|
| 151 |
+
|
| 152 |
+
## Reward Design
|
| 153 |
+
|
| 154 |
+
The environment uses a **7-component weighted grader** to prevent reward hacking and
|
| 155 |
+
ensure every aspect of logistics performance is measured independently.
|
| 156 |
+
|
| 157 |
+
### Episode Grader (`graders.py`)
|
| 158 |
+
|
| 159 |
+
| Component | Weight | What It Measures |
|
| 160 |
+
|-----------|--------|-----------------|
|
| 161 |
+
| Bottleneck avoidance | 18% | How often any node exceeded capacity |
|
| 162 |
+
| Network balance | 18% | Average load-gap between most and least loaded nodes |
|
| 163 |
+
| Step reward | 14% | Average per-step reward across the episode |
|
| 164 |
+
| Retail delivery | 20% | Freight actually delivered to retail nodes vs target |
|
| 165 |
+
| SLA compliance | 15% | Deliveries arriving within their deadline window |
|
| 166 |
+
| Disruption recovery | 10% | How quickly the network stabilised after each disruption |
|
| 167 |
+
| Action validity | 5% | Fraction of legal (connected) routing decisions |
|
| 168 |
+
|
| 169 |
+
### Training Reward (`action_reward` in `train_grpo.py`)
|
| 170 |
+
|
| 171 |
+
The GRPO training reward is a 5-component verifiable reward:
|
| 172 |
+
|
| 173 |
+
| Component | Max | What It Checks |
|
| 174 |
+
|-----------|-----|---------------|
|
| 175 |
+
| Valid JSON | 0.20 | Output is parseable JSON |
|
| 176 |
+
| Required keys | 0.20 | All 4 fields present: reasoning, source, dest, volume |
|
| 177 |
+
| Correct source node | 0.20 | source_node matches the episode's current shipment |
|
| 178 |
+
| Connected destination | 0.25 | dest_node is a legal neighbour of source_node |
|
| 179 |
+
| Plausible volume | 0.15 | 0 < shipment_volume ≤ 60 and close to incoming load |
|
| 180 |
+
|
| 181 |
+
### Anti-Gaming Guards
|
| 182 |
+
|
| 183 |
+
- Reward only counts on **confirmed delivery**, not on dispatch
|
| 184 |
+
- **Route-repeat penalty** for consecutive identical routing decisions
|
| 185 |
+
- **Risk penalty** for routing through actively disrupted nodes
|
| 186 |
+
- **Overload penalty** applied even if JSON format is perfect
|
| 187 |
+
- All reward components are **independent** — gaming one does not inflate others
|
| 188 |
+
|
| 189 |
+
---
|
| 190 |
+
|
| 191 |
+
## Training
|
| 192 |
+
|
| 193 |
+
### Method: SFT Warm-Up → GRPO
|
| 194 |
+
|
| 195 |
+
Training uses a two-phase approach:
|
| 196 |
+
|
| 197 |
+
**Phase 1 — SFT Warm-Up (20 steps)**
|
| 198 |
+
Qwen2.5-0.5B-Instruct does not reliably output valid JSON from a cold start. A brief supervised
|
| 199 |
+
fine-tuning step on ideal routing examples teaches the model the output format. Without this,
|
| 200 |
+
GRPO sees reward = 0 for most early generations and cannot learn.
|
| 201 |
+
|
| 202 |
+
**Phase 2 — GRPO (200 steps)**
|
| 203 |
+
Starting from the SFT checkpoint, GRPO optimises the model against the verifiable reward function.
|
| 204 |
+
The model generates 4 completions per prompt; GRPO compares them within the group and pushes the
|
| 205 |
+
model toward higher-scoring routing decisions.
|
| 206 |
+
|
| 207 |
+
### Training Stack
|
| 208 |
+
|
| 209 |
+
```
|
| 210 |
+
OpenEnv environment → live rollout prompts → TRL GRPOTrainer
|
| 211 |
+
+ Unsloth (QLoRA r=16)
|
| 212 |
+
+ Qwen2.5-0.5B-Instruct
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
| Parameter | Value |
|
| 216 |
+
|-----------|-------|
|
| 217 |
+
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` |
|
| 218 |
+
| Adapter | LoRA r=16, α=32 |
|
| 219 |
+
| Optimiser | GRPO via TRL |
|
| 220 |
+
| Max steps | 200 |
|
| 221 |
+
| Generations per prompt | 4 |
|
| 222 |
+
| Learning rate | 5e-6 |
|
| 223 |
+
| GPU | T4 (Colab free tier) |
|
| 224 |
+
| Total training time | ~45 minutes |
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## Results
|
| 229 |
+
|
| 230 |
+
### Baseline Policy Comparison
|
| 231 |
+
|
| 232 |
+
The table below shows three hand-coded baselines evaluated on all three tasks
|
| 233 |
+
**before any LLM training**. These are the targets the trained model must beat.
|
| 234 |
+
|
| 235 |
+
| Policy | Avg Score | Avg SLA Rate | Avg Priority Service | Avg Invalid Actions |
|
| 236 |
+
|--------|-----------|-------------|---------------------|---------------------|
|
| 237 |
+
| **Round-Robin** | 0.469 | 0.0% | 0.0% | 2.0 |
|
| 238 |
+
| **Heuristic** | 0.782 | 100.0% | 6.6% | 3.3 |
|
| 239 |
+
| **Resilient** | 0.776 | 100.0% | 4.3% | 3.0 |
|
| 240 |
+
|
| 241 |
+
**Key insight:** Round-robin achieves 0% SLA success rate despite reasonable step rewards —
|
| 242 |
+
because it ignores delivery deadlines entirely. Heuristic achieves 100% SLA but still
|
| 243 |
+
fails on priority service (6.6%) and produces invalid actions under disruption.
|
| 244 |
+
The trained GRPO model targets both gaps.
|
| 245 |
+
|
| 246 |
+
### Per-Task Breakdown
|
| 247 |
+
|
| 248 |
+
| Task | Round-Robin | Heuristic | Resilient |
|
| 249 |
+
|------|------------|-----------|-----------|
|
| 250 |
+
| Easy | 0.473 | 0.768 | 0.761 |
|
| 251 |
+
| Medium | 0.472 | 0.763 | 0.752 |
|
| 252 |
+
| Hard | 0.461 | 0.814 | 0.814 |
|
| 253 |
+
|
| 254 |
+
### Training Evidence
|
| 255 |
+
|
| 256 |
+
The reward curve below shows GRPO training progress. After the SFT warm-up,
|
| 257 |
+
the model starts producing valid JSON immediately and reward climbs from the first steps.
|
| 258 |
+
|
| 259 |
+

|
| 260 |
+
*Figure 1: GRPO training reward over 200 logging steps
|
| 261 |
+
|
| 262 |
+

|
| 263 |
+
*Figure 2: Policy comparison across all three task difficulties. Green bars = trained model
|
| 264 |
+
(after GRPO). Blue bars = base model (before GRPO). Amber bars = heuristic baseline.*
|
| 265 |
+
|
| 266 |
+

|
| 267 |
+
*Figure 3: Detailed metrics breakdown — overall score, SLA rate, retail delivered, invalid
|
| 268 |
+
actions, and bottlenecks — for all three policies across all three tasks.*
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## What the Trained Agent Thinks
|
| 273 |
+
|
| 274 |
+
Below is an example of the trained Qwen2.5-0.5B model reasoning through a hard-task
|
| 275 |
+
disruption scenario at step 14. This is the chain-of-thought the model produces before
|
| 276 |
+
taking an action:
|
| 277 |
+
|
| 278 |
+
```
|
| 279 |
+
Situation: Port 2 is at 87% load with an active weather disruption (3 steps remaining).
|
| 280 |
+
Warehouse Beta has 44% load and 33% buffer capacity. 21.5 units incoming from Supplier Port.
|
| 281 |
+
|
| 282 |
+
Model output:
|
| 283 |
+
{
|
| 284 |
+
"reasoning": "Supplier Port (node 2) is experiencing a weather disruption with 3 steps
|
| 285 |
+
remaining and is near capacity at 87%. Routing through node 5 (Warehouse Beta) which
|
| 286 |
+
has significant buffer at 44% capacity and is not disrupted. This avoids contributing
|
| 287 |
+
to the congestion at node 2 and reduces cascade risk to downstream DC Coastal.",
|
| 288 |
+
"source_node": 2,
|
| 289 |
+
"dest_node": 5,
|
| 290 |
+
"shipment_volume": 21.5
|
| 291 |
+
}
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
The heuristic would route to the nearest available node. The trained model routes to the
|
| 295 |
+
node that minimises cascade probability — a fundamentally different reasoning pattern.
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
## Running Locally
|
| 300 |
+
|
| 301 |
+
### Start the environment server
|
| 302 |
+
|
| 303 |
+
```bash
|
| 304 |
+
git clone https://github.com/Roshan5105labs/crisis-logistics-env.git
|
| 305 |
+
cd crisis-logistics-env/crisis_logistics_env
|
| 306 |
+
pip install -e .
|
| 307 |
+
uvicorn server.app:app --host 0.0.0.0 --port 8000
|
| 308 |
+
```
|
| 309 |
+
|
| 310 |
+
### Test the environment (no LLM required)
|
| 311 |
+
|
| 312 |
+
```python
|
| 313 |
+
from crisis_logistics_env.server.crisis_logistics_env_environment import (
|
| 314 |
+
CrisisLogisticsEnvironment, choose_network_action
|
| 315 |
+
)
|
| 316 |
+
|
| 317 |
+
env = CrisisLogisticsEnvironment()
|
| 318 |
+
obs = env.reset(task_id="hard")
|
| 319 |
+
while not obs.done:
|
| 320 |
+
obs = env.step(choose_network_action(obs))
|
| 321 |
+
print(f"Score: {env.score:.3f}")
|
| 322 |
+
```
|
| 323 |
+
|
| 324 |
+
### Run the trained LLM agent
|
| 325 |
+
|
| 326 |
+
```bash
|
| 327 |
+
# Set your HuggingFace token for Qwen-72B inference
|
| 328 |
+
export HF_TOKEN=your_token_here
|
| 329 |
+
python inference.py
|
| 330 |
+
```
|
| 331 |
+
|
| 332 |
+
### Reproduce training
|
| 333 |
+
|
| 334 |
+
```bash
|
| 335 |
+
python train_grpo.py \
|
| 336 |
+
--model-name "Qwen/Qwen2.5-0.5B-Instruct" \
|
| 337 |
+
--max-steps 200 \
|
| 338 |
+
--output-dir "outputs/logiflow-grpo-script"
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
Or open the Colab notebook for a one-click reproducible run:
|
| 342 |
+
[](https://colab.research.google.com/github/Roshan5105labs/crisis-logistics-env/blob/main/notebooks/logiflow_grpo_colab.ipynb)
|
| 343 |
+
|
| 344 |
+
---
|
| 345 |
+
|
| 346 |
+
## API Endpoints
|
| 347 |
+
|
| 348 |
+
The environment is served as a FastAPI application and is fully OpenEnv-compliant.
|
| 349 |
+
|
| 350 |
+
| Endpoint | Method | Description |
|
| 351 |
+
|----------|--------|-------------|
|
| 352 |
+
| `/health` | GET | Returns `{"status": "healthy"}` — judges use this to verify the Space is live |
|
| 353 |
+
| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
|
| 354 |
+
| `/step` | POST | Take one action. Body: `{"action": {"source_node": 2, "dest_node": 5, "shipment_volume": 18.5}}` |
|
| 355 |
+
| `/state` | GET | Full internal state (all 12 nodes visible, no partial observability) |
|
| 356 |
+
| `/schema` | GET | OpenAPI schema |
|
| 357 |
+
| `/web` | GET | Live network visualizer dashboard |
|
| 358 |
+
|
| 359 |
+
---
|
| 360 |
+
|
| 361 |
+
## Project Structure
|
| 362 |
+
|
| 363 |
+
```
|
| 364 |
+
crisis_logistics_env/
|
| 365 |
+
├── models.py # Action, Observation, State dataclasses
|
| 366 |
+
├── tasks.py # Task configs (easy / medium / hard)
|
| 367 |
+
├── graders.py # 7-component episode grader (0.0–1.0)
|
| 368 |
+
├── train_grpo.py # Production GRPO training script
|
| 369 |
+
├── inference.py # LLM agent loop (Qwen-72B via HF router)
|
| 370 |
+
├── train_and_evaluate.py # Baseline policy evaluation
|
| 371 |
+
├── gym_env.py # gymnasium.Env wrapper
|
| 372 |
+
├── client.py # HTTP client for server
|
| 373 |
+
├── server/
|
| 374 |
+
│ ├── app.py # FastAPI server (7 endpoints)
|
| 375 |
+
│ └── crisis_logistics_env_environment.py # World simulation engine
|
| 376 |
+
├── visualisation/
|
| 377 |
+
│ └── logiflow_visualizer.html # Live dashboard (served at /web)
|
| 378 |
+
├── notebooks/
|
| 379 |
+
│ └── logiflow_grpo_colab.ipynb # Reproducible training notebook
|
| 380 |
+
├── artifacts/
|
| 381 |
+
│ ├── benchmark_summary.json # Baseline policy results
|
| 382 |
+
│ ├── reward_curve.png # GRPO training curve
|
| 383 |
+
│ ├── before_after_comparison.png # Policy comparison chart
|
| 384 |
+
│ └── metrics_panel.png # Detailed metrics breakdown
|
| 385 |
+
├── openenv.yaml # OpenEnv manifest
|
| 386 |
+
└── Dockerfile # HuggingFace Space deployment
|
| 387 |
+
```
|
| 388 |
+
|
| 389 |
+
---
|
| 390 |
+
|
| 391 |
+
## Links
|
| 392 |
+
|
| 393 |
+
| Resource | Link |
|
| 394 |
+
|----------|------|
|
| 395 |
+
| 🤗 HuggingFace Space (live environment) | [Add your Space URL] |
|
| 396 |
+
| 📓 Colab Training Notebook | [Add your Colab URL] |
|
| 397 |
+
| 📝 HuggingFace Blog Post | [Add your blog URL] |
|
| 398 |
+
| 🎥 Demo Video | [Add your YouTube URL] |
|
| 399 |
+
|
| 400 |
+
---
|
| 401 |
+
|
| 402 |
+
## Why This Matters
|
| 403 |
+
|
| 404 |
+
Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
|
| 405 |
+
The gap is not infrastructure — it is decision-making speed and anticipatory reasoning.
|
| 406 |
+
|
| 407 |
+
An LLM trained on LogiFlow-RL learns to:
|
| 408 |
+
- Read congestion signals before they become bottlenecks
|
| 409 |
+
- Reason about partial information the way a real logistics manager would
|
| 410 |
+
- Anticipate cascade effects from disruptions it cannot directly observe
|
| 411 |
+
- Balance competing priorities: throughput, SLA compliance, and network stability
|
| 412 |
+
|
| 413 |
+
This environment exists to teach LLMs something they currently cannot do well — and to
|
| 414 |
+
prove that teaching is measurable.
|
| 415 |
+
|
| 416 |
+
---
|
| 417 |
+
|
| 418 |
+
## Citation
|
| 419 |
+
|
| 420 |
+
```bibtex
|
| 421 |
+
@misc{logiflow-rl-2026,
|
| 422 |
+
title = {LogiFlow-RL: Training LLMs for Proactive Supply Chain Crisis Management},
|
| 423 |
+
author = {Your Name},
|
| 424 |
+
year = {2026},
|
| 425 |
+
howpublished = {OpenEnv Hackathon India 2026 — Theme \#2: Long-Horizon Planning},
|
| 426 |
+
url = {https://huggingface.co/spaces/<your-space-url>}
|
| 427 |
+
}
|
| 428 |
+
```
|
| 429 |
+
|
| 430 |
+
---
|
| 431 |
+
|
| 432 |
+
*Submitted to the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
|
server/app.py
CHANGED
|
@@ -12,7 +12,7 @@ from pathlib import Path
|
|
| 12 |
from typing import Any, Dict, Literal, Optional
|
| 13 |
|
| 14 |
from fastapi import FastAPI, HTTPException
|
| 15 |
-
from fastapi.responses import HTMLResponse
|
| 16 |
from pydantic import BaseModel
|
| 17 |
|
| 18 |
try:
|
|
@@ -219,6 +219,17 @@ async def web_landing() -> HTMLResponse:
|
|
| 219 |
return HTMLResponse(_read_visualizer_html())
|
| 220 |
|
| 221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
@app.get("/visualizer", response_class=HTMLResponse, tags=["Environment Info"])
|
| 223 |
async def visualizer() -> HTMLResponse:
|
| 224 |
return HTMLResponse(_read_visualizer_html())
|
|
|
|
| 12 |
from typing import Any, Dict, Literal, Optional
|
| 13 |
|
| 14 |
from fastapi import FastAPI, HTTPException
|
| 15 |
+
from fastapi.responses import HTMLResponse, RedirectResponse
|
| 16 |
from pydantic import BaseModel
|
| 17 |
|
| 18 |
try:
|
|
|
|
| 219 |
return HTMLResponse(_read_visualizer_html())
|
| 220 |
|
| 221 |
|
| 222 |
+
@app.get("/web/", response_class=HTMLResponse, tags=["Environment Info"])
|
| 223 |
+
async def web_landing_slash() -> HTMLResponse:
|
| 224 |
+
return HTMLResponse(_read_visualizer_html())
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
@app.get("/server", include_in_schema=False)
|
| 228 |
+
async def server_compat() -> RedirectResponse:
|
| 229 |
+
"""Compatibility route used by some deployment templates."""
|
| 230 |
+
return RedirectResponse(url="/web")
|
| 231 |
+
|
| 232 |
+
|
| 233 |
@app.get("/visualizer", response_class=HTMLResponse, tags=["Environment Info"])
|
| 234 |
async def visualizer() -> HTMLResponse:
|
| 235 |
return HTMLResponse(_read_visualizer_html())
|