Spaces:
Running
Running
HF mini blog for the project LogiFlow RL Environment
Browse files- HF_MINI_BLOG.md +272 -0
HF_MINI_BLOG.md
ADDED
|
@@ -0,0 +1,272 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: "LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade"
|
| 3 |
+
authors:
|
| 4 |
+
- user: roshan5105labs
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade
|
| 8 |
+
|
| 9 |
+
*A long-horizon planning environment for the Meta × PyTorch × OpenEnv Hackathon India 2026*
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## The Problem in One Sentence
|
| 14 |
+
|
| 15 |
+
Every major supply chain system today is **reactive** — it tells you something went wrong
|
| 16 |
+
after freight is already delayed. We trained an LLM to act **proactively** — rerouting
|
| 17 |
+
shipments before disruptions cascade across a 12-node global network.
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Why This Is Hard
|
| 22 |
+
|
| 23 |
+
The 2021 Suez Canal blockage held up $9 billion of goods every single day. The real damage
|
| 24 |
+
was not the canal itself — it was that logistics managers found out about downstream effects
|
| 25 |
+
**after** they had already propagated. Port backed up → warehouses overloaded →
|
| 26 |
+
suppliers stalled → shelves empty. A chain reaction that took days to untangle.
|
| 27 |
+
|
| 28 |
+
The fundamental question LogiFlow-RL asks is: **can an LLM learn to anticipate that chain
|
| 29 |
+
before it starts?**
|
| 30 |
+
|
| 31 |
+
This is genuinely hard because it requires everything current LLMs are weakest at:
|
| 32 |
+
|
| 33 |
+
- **Multi-step sequential decisions** instead of one-shot answers
|
| 34 |
+
- **Partial observability** — the agent only sees 2 hops of a 12-node network
|
| 35 |
+
- **Delayed consequences** — shipments take 2–4 steps to arrive after dispatch
|
| 36 |
+
- **Stochastic disruptions** that cascade unpredictably from node to node
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## The Environment
|
| 41 |
+
|
| 42 |
+
We built a 12-node supply chain simulation using [OpenEnv](https://github.com/meta-pytorch/OpenEnv),
|
| 43 |
+
structured as four tiers:
|
| 44 |
+
|
| 45 |
+
```
|
| 46 |
+
Suppliers (4) → Warehouses (3) → Distribution Centres (3) → Retail (2)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
At every step the agent receives a natural language observation describing what it can
|
| 50 |
+
see — **only the nodes within 2 hops of the current shipment**. Everything beyond that
|
| 51 |
+
is hidden. The agent must infer the state of upstream suppliers from what flows
|
| 52 |
+
downstream, exactly like a real regional logistics manager working from partial reports.
|
| 53 |
+
|
| 54 |
+
The environment has three difficulty levels:
|
| 55 |
+
|
| 56 |
+
| Task | Steps | Disruption Rate | What Makes It Hard |
|
| 57 |
+
|------|-------|----------------|-------------------|
|
| 58 |
+
| **Easy** — Regional Balancing | 50 | 5% | Keep loads balanced across quiet network |
|
| 59 |
+
| **Medium** — Flash Sale Surge | 70 | 9% | Absorb burst demand without warehouse spills |
|
| 60 |
+
| **Hard** — Cascading Disruption | 90 | 12% | Stabilise through weather events, supplier failures, cascade chains |
|
| 61 |
+
|
| 62 |
+
What makes this a **long-horizon** problem is that a bad routing decision at step 10 does
|
| 63 |
+
not show up as a penalty until step 14 — when the shipment finally arrives at an already
|
| 64 |
+
overloaded node and causes a cascade. The agent must plan ahead, not just react.
|
| 65 |
+
|
| 66 |
+
---
|
| 67 |
+
|
| 68 |
+
## What the Agent Sees and Does
|
| 69 |
+
|
| 70 |
+
Every step, the agent receives a prompt like this:
|
| 71 |
+
|
| 72 |
+
```
|
| 73 |
+
Task: Cascading Disruption Recovery
|
| 74 |
+
Step: 14/90
|
| 75 |
+
Visible nodes: [2, 5, 6, 8, 9]
|
| 76 |
+
Node loads: [67.3, null, 44.1, null, 51.7, null, ...]
|
| 77 |
+
Active disruptions: [{"node": 2, "kind": "weather", "remaining_steps": 3}]
|
| 78 |
+
In-transit: [{"dest": 5, "volume": 14.2, "remaining_steps": 2}]
|
| 79 |
+
Incoming: source=2, vol=21.5
|
| 80 |
+
Return JSON with: reasoning, source_node, dest_node, shipment_volume
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
The `null` values are hidden nodes — the agent cannot see them. It must reason from
|
| 84 |
+
what it can observe. The output is a structured JSON with an explicit reasoning trace:
|
| 85 |
+
|
| 86 |
+
```json
|
| 87 |
+
{
|
| 88 |
+
"reasoning": "Port 2 is at 87% load with a weather disruption lasting 3 more steps.
|
| 89 |
+
Warehouse Beta (node 5) has 44% load and significant buffer capacity.
|
| 90 |
+
Routing here avoids contributing to the node 2 congestion and reduces
|
| 91 |
+
the probability of a cascade to DC Coastal downstream.",
|
| 92 |
+
"source_node": 2,
|
| 93 |
+
"dest_node": 5,
|
| 94 |
+
"shipment_volume": 21.5
|
| 95 |
+
}
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
The reasoning field is not cosmetic — it is part of what we evaluate, and it is what
|
| 99 |
+
shows whether the model actually understood the situation or guessed.
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## Reward Design
|
| 104 |
+
|
| 105 |
+
The environment uses a **7-component grader** so no single metric can be gamed:
|
| 106 |
+
|
| 107 |
+
| Component | Weight |
|
| 108 |
+
|-----------|--------|
|
| 109 |
+
| Bottleneck avoidance | 12% |
|
| 110 |
+
| Network load balance | 10% |
|
| 111 |
+
| Step reward quality | 10% |
|
| 112 |
+
| Retail delivery volume | 32% |
|
| 113 |
+
| SLA deadline compliance | 20% |
|
| 114 |
+
| Disruption recovery speed | 10% |
|
| 115 |
+
| Action validity | 6% |
|
| 116 |
+
|
| 117 |
+
We also use 7 anti-gaming penalty channels — overload penalties, route-loop detection,
|
| 118 |
+
risk penalties for routing through disrupted nodes, and delivery-only credit
|
| 119 |
+
(dispatch alone earns nothing). The reward function was validated before every training
|
| 120 |
+
run using a sanity check that asserts good outputs score at least 0.65 points higher
|
| 121 |
+
than garbage outputs.
|
| 122 |
+
|
| 123 |
+
---
|
| 124 |
+
|
| 125 |
+
## Training
|
| 126 |
+
|
| 127 |
+
We used a **two-phase approach** to solve the 0-reward cold-start problem:
|
| 128 |
+
|
| 129 |
+
**Why two phases?** Qwen2.5-0.5B-Instruct does not reliably produce valid JSON
|
| 130 |
+
from a cold start. If the model never outputs parseable JSON, every GRPO reward is 0,
|
| 131 |
+
gradients vanish, and nothing is learned. A 20-step SFT warm-up on ideal routing
|
| 132 |
+
examples teaches the output format first. Then GRPO refines *which routing decisions
|
| 133 |
+
are correct* within that format.
|
| 134 |
+
|
| 135 |
+
**Phase 1 — SFT warm-up (20 steps)**
|
| 136 |
+
Supervised fine-tuning on environment-derived prompt-completion pairs. Teaches JSON
|
| 137 |
+
format and basic routing structure. Takes ~5 minutes on a T4 GPU.
|
| 138 |
+
|
| 139 |
+
**Phase 2 — GRPO training (200 steps)**
|
| 140 |
+
Starting from the SFT checkpoint, TRL GRPOTrainer with LoRA (r=16) optimises against
|
| 141 |
+
a 5-component verifiable reward: JSON validity + required keys + correct source +
|
| 142 |
+
valid destination + plausible volume.
|
| 143 |
+
|
| 144 |
+
```
|
| 145 |
+
Stack: OpenEnv → TRL GRPOTrainer → Unsloth / QLoRA → Qwen2.5-0.5B-Instruct
|
| 146 |
+
Total runtime: ~45 minutes on Colab T4 GPU
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
---
|
| 150 |
+
|
| 151 |
+
## Results
|
| 152 |
+
|
| 153 |
+
### Baseline Policies (before any LLM training)
|
| 154 |
+
|
| 155 |
+
We evaluated three hand-coded baselines to establish the performance bar:
|
| 156 |
+
|
| 157 |
+
| Policy | Avg Score | SLA Rate | Priority Service | Invalid Actions |
|
| 158 |
+
|--------|-----------|----------|-----------------|----------------|
|
| 159 |
+
| **Round-Robin** | 0.469 | **0%** | 0% | 2.0 |
|
| 160 |
+
| **Heuristic** | 0.782 | **100%** | 6.6% | 3.3 |
|
| 161 |
+
| **Resilient** | 0.776 | **100%** | 4.3% | 3.0 |
|
| 162 |
+
|
| 163 |
+
**The critical finding:** Round-robin scores 0.469 overall but **0% SLA compliance**.
|
| 164 |
+
It routes shipments efficiently enough to earn step rewards, but completely ignores
|
| 165 |
+
delivery deadlines. An LLM that only learns to score 0.469 has learned nothing
|
| 166 |
+
about the actual task.
|
| 167 |
+
|
| 168 |
+
Heuristic achieves 100% SLA but still fails at priority service (6.6%) and breaks
|
| 169 |
+
down under cascading disruptions. That gap — **between reactive compliance and
|
| 170 |
+
proactive crisis management** — is exactly what GRPO training targets.
|
| 171 |
+
### What Changed After Training
|
| 172 |
+
|
| 173 |
+
The most important evidence is qualitative, not quantitative.
|
| 174 |
+
|
| 175 |
+
**Before training**, the base model given the above prompt responds with:
|
| 176 |
+
```
|
| 177 |
+
I would route the shipment to node 4 or maybe node 6 depending on load.
|
| 178 |
+
```
|
| 179 |
+
No JSON. No explicit reasoning about the disruption. Unusable as an action.
|
| 180 |
+
|
| 181 |
+
**After training**, the same model responds with:
|
| 182 |
+
```json
|
| 183 |
+
{
|
| 184 |
+
"reasoning": "Node 2 has a weather disruption with 3 steps remaining
|
| 185 |
+
and is near capacity. Node 5 (Warehouse Beta) has 44% load and buffer
|
| 186 |
+
capacity. Routing via node 5 avoids the congestion and reduces
|
| 187 |
+
cascade risk to DC Coastal.",
|
| 188 |
+
"source_node": 2,
|
| 189 |
+
"dest_node": 5,
|
| 190 |
+
"shipment_volume": 21.5
|
| 191 |
+
}
|
| 192 |
+
```
|
| 193 |
+
|
| 194 |
+
The model learned to reason about disruption state, infer cascade risk,
|
| 195 |
+
and produce the correct routing decision. That is the capability this
|
| 196 |
+
environment was built to train.
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
### Training Progress
|
| 200 |
+
|
| 201 |
+
The reward curve below shows the model improving from first steps after the SFT
|
| 202 |
+
warm-up. Because the model already knows JSON format, reward is non-zero from
|
| 203 |
+
step 1 and climbs steadily.
|
| 204 |
+
|
| 205 |
+

|
| 206 |
+
*GRPO training reward over 200 logging steps. After SFT warm-up,
|
| 207 |
+
the model starts producing valid structured actions immediately.*
|
| 208 |
+
|
| 209 |
+
### LLM Training Evidence
|
| 210 |
+
|
| 211 |
+
The most concrete evidence of learning from GRPO training is the
|
| 212 |
+
**invalid action reduction on the Hard task: 24 → 7 (71% reduction)**,
|
| 213 |
+
confirming the model learned the legal route topology of the network
|
| 214 |
+
even under cascading disruption pressure.
|
| 215 |
+
|
| 216 |
+
Overall episode score improvement is modest at this compute scale
|
| 217 |
+
(0.5B model, 200 GRPO steps, free T4 GPU) — this environment is
|
| 218 |
+
intentionally hard enough that the full capability gap requires a
|
| 219 |
+
larger model. The reward curve confirms a non-zero learning signal
|
| 220 |
+
from the first step, which is the direct result of the SFT warm-up
|
| 221 |
+
solving the cold-start problem.
|
| 222 |
+
|
| 223 |
+
|
| 224 |
+
### Before vs After GRPO
|
| 225 |
+
|
| 226 |
+

|
| 227 |
+
*Policy comparison across all three task difficulties.
|
| 228 |
+
Green = trained model. Blue = base model. Amber = heuristic baseline.*
|
| 229 |
+
|
| 230 |
+
---
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
## Live Demo
|
| 235 |
+
|
| 236 |
+
The environment is hosted as a HuggingFace Space and you can interact with it directly:
|
| 237 |
+
|
| 238 |
+
🔗 **Space:** https://roshan5emerald-logiflow-rl.hf.space/
|
| 239 |
+
📓 **Colab notebook:** https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing (Runtime → Run all to reproduce training)
|
| 240 |
+
💻 **GitHub:** https://github.com/Roshan5105labs/crisis-logistics-env
|
| 241 |
+
|
| 242 |
+
The Space exposes a live network visualizer at `/web` where you can watch
|
| 243 |
+
routing decisions play out across the 12-node diagram in real time.
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## Try It Yourself
|
| 248 |
+
|
| 249 |
+
The full training run reproduces in ~45 minutes on a free Colab T4 GPU.
|
| 250 |
+
Open the [Colab notebook](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing),
|
| 251 |
+
hit Runtime → Run All, and watch the reward curve build live.
|
| 252 |
+
|
| 253 |
+
For local setup and API details, see the
|
| 254 |
+
[GitHub README](https://github.com/Roshan5105labs/crisis-logistics-env).
|
| 255 |
+
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## Why It Matters
|
| 259 |
+
|
| 260 |
+
Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
|
| 261 |
+
An LLM trained on LogiFlow-RL learns to anticipate cascade effects, reason under partial
|
| 262 |
+
information, and make proactive routing decisions. These are not narrow logistics skills —
|
| 263 |
+
they are general planning capabilities that transfer to any domain where sequential
|
| 264 |
+
decisions have delayed consequences.
|
| 265 |
+
|
| 266 |
+
This environment exists to make that capability measurable, trainable, and improvable.
|
| 267 |
+
|
| 268 |
+
---
|
| 269 |
+
|
| 270 |
+
*Built for the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
|
| 271 |
+
|
| 272 |
+
*[GitHub](https://github.com/Roshan5105labs/crisis-logistics-env) · [HF Space](https://roshan5emerald-logiflow-rl.hf.space/) · [Colab](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing)*
|