roshan5emerald commited on
Commit
2cce423
·
verified ·
1 Parent(s): a4b6177

HF mini blog for the project LogiFlow RL Environment

Browse files
Files changed (1) hide show
  1. HF_MINI_BLOG.md +272 -0
HF_MINI_BLOG.md ADDED
@@ -0,0 +1,272 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade"
3
+ authors:
4
+ - user: roshan5105labs
5
+ ---
6
+
7
+ # LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade
8
+
9
+ *A long-horizon planning environment for the Meta × PyTorch × OpenEnv Hackathon India 2026*
10
+
11
+ ---
12
+
13
+ ## The Problem in One Sentence
14
+
15
+ Every major supply chain system today is **reactive** — it tells you something went wrong
16
+ after freight is already delayed. We trained an LLM to act **proactively** — rerouting
17
+ shipments before disruptions cascade across a 12-node global network.
18
+
19
+ ---
20
+
21
+ ## Why This Is Hard
22
+
23
+ The 2021 Suez Canal blockage held up $9 billion of goods every single day. The real damage
24
+ was not the canal itself — it was that logistics managers found out about downstream effects
25
+ **after** they had already propagated. Port backed up → warehouses overloaded →
26
+ suppliers stalled → shelves empty. A chain reaction that took days to untangle.
27
+
28
+ The fundamental question LogiFlow-RL asks is: **can an LLM learn to anticipate that chain
29
+ before it starts?**
30
+
31
+ This is genuinely hard because it requires everything current LLMs are weakest at:
32
+
33
+ - **Multi-step sequential decisions** instead of one-shot answers
34
+ - **Partial observability** — the agent only sees 2 hops of a 12-node network
35
+ - **Delayed consequences** — shipments take 2–4 steps to arrive after dispatch
36
+ - **Stochastic disruptions** that cascade unpredictably from node to node
37
+
38
+ ---
39
+
40
+ ## The Environment
41
+
42
+ We built a 12-node supply chain simulation using [OpenEnv](https://github.com/meta-pytorch/OpenEnv),
43
+ structured as four tiers:
44
+
45
+ ```
46
+ Suppliers (4) → Warehouses (3) → Distribution Centres (3) → Retail (2)
47
+ ```
48
+
49
+ At every step the agent receives a natural language observation describing what it can
50
+ see — **only the nodes within 2 hops of the current shipment**. Everything beyond that
51
+ is hidden. The agent must infer the state of upstream suppliers from what flows
52
+ downstream, exactly like a real regional logistics manager working from partial reports.
53
+
54
+ The environment has three difficulty levels:
55
+
56
+ | Task | Steps | Disruption Rate | What Makes It Hard |
57
+ |------|-------|----------------|-------------------|
58
+ | **Easy** — Regional Balancing | 50 | 5% | Keep loads balanced across quiet network |
59
+ | **Medium** — Flash Sale Surge | 70 | 9% | Absorb burst demand without warehouse spills |
60
+ | **Hard** — Cascading Disruption | 90 | 12% | Stabilise through weather events, supplier failures, cascade chains |
61
+
62
+ What makes this a **long-horizon** problem is that a bad routing decision at step 10 does
63
+ not show up as a penalty until step 14 — when the shipment finally arrives at an already
64
+ overloaded node and causes a cascade. The agent must plan ahead, not just react.
65
+
66
+ ---
67
+
68
+ ## What the Agent Sees and Does
69
+
70
+ Every step, the agent receives a prompt like this:
71
+
72
+ ```
73
+ Task: Cascading Disruption Recovery
74
+ Step: 14/90
75
+ Visible nodes: [2, 5, 6, 8, 9]
76
+ Node loads: [67.3, null, 44.1, null, 51.7, null, ...]
77
+ Active disruptions: [{"node": 2, "kind": "weather", "remaining_steps": 3}]
78
+ In-transit: [{"dest": 5, "volume": 14.2, "remaining_steps": 2}]
79
+ Incoming: source=2, vol=21.5
80
+ Return JSON with: reasoning, source_node, dest_node, shipment_volume
81
+ ```
82
+
83
+ The `null` values are hidden nodes — the agent cannot see them. It must reason from
84
+ what it can observe. The output is a structured JSON with an explicit reasoning trace:
85
+
86
+ ```json
87
+ {
88
+ "reasoning": "Port 2 is at 87% load with a weather disruption lasting 3 more steps.
89
+ Warehouse Beta (node 5) has 44% load and significant buffer capacity.
90
+ Routing here avoids contributing to the node 2 congestion and reduces
91
+ the probability of a cascade to DC Coastal downstream.",
92
+ "source_node": 2,
93
+ "dest_node": 5,
94
+ "shipment_volume": 21.5
95
+ }
96
+ ```
97
+
98
+ The reasoning field is not cosmetic — it is part of what we evaluate, and it is what
99
+ shows whether the model actually understood the situation or guessed.
100
+
101
+ ---
102
+
103
+ ## Reward Design
104
+
105
+ The environment uses a **7-component grader** so no single metric can be gamed:
106
+
107
+ | Component | Weight |
108
+ |-----------|--------|
109
+ | Bottleneck avoidance | 12% |
110
+ | Network load balance | 10% |
111
+ | Step reward quality | 10% |
112
+ | Retail delivery volume | 32% |
113
+ | SLA deadline compliance | 20% |
114
+ | Disruption recovery speed | 10% |
115
+ | Action validity | 6% |
116
+
117
+ We also use 7 anti-gaming penalty channels — overload penalties, route-loop detection,
118
+ risk penalties for routing through disrupted nodes, and delivery-only credit
119
+ (dispatch alone earns nothing). The reward function was validated before every training
120
+ run using a sanity check that asserts good outputs score at least 0.65 points higher
121
+ than garbage outputs.
122
+
123
+ ---
124
+
125
+ ## Training
126
+
127
+ We used a **two-phase approach** to solve the 0-reward cold-start problem:
128
+
129
+ **Why two phases?** Qwen2.5-0.5B-Instruct does not reliably produce valid JSON
130
+ from a cold start. If the model never outputs parseable JSON, every GRPO reward is 0,
131
+ gradients vanish, and nothing is learned. A 20-step SFT warm-up on ideal routing
132
+ examples teaches the output format first. Then GRPO refines *which routing decisions
133
+ are correct* within that format.
134
+
135
+ **Phase 1 — SFT warm-up (20 steps)**
136
+ Supervised fine-tuning on environment-derived prompt-completion pairs. Teaches JSON
137
+ format and basic routing structure. Takes ~5 minutes on a T4 GPU.
138
+
139
+ **Phase 2 — GRPO training (200 steps)**
140
+ Starting from the SFT checkpoint, TRL GRPOTrainer with LoRA (r=16) optimises against
141
+ a 5-component verifiable reward: JSON validity + required keys + correct source +
142
+ valid destination + plausible volume.
143
+
144
+ ```
145
+ Stack: OpenEnv → TRL GRPOTrainer → Unsloth / QLoRA → Qwen2.5-0.5B-Instruct
146
+ Total runtime: ~45 minutes on Colab T4 GPU
147
+ ```
148
+
149
+ ---
150
+
151
+ ## Results
152
+
153
+ ### Baseline Policies (before any LLM training)
154
+
155
+ We evaluated three hand-coded baselines to establish the performance bar:
156
+
157
+ | Policy | Avg Score | SLA Rate | Priority Service | Invalid Actions |
158
+ |--------|-----------|----------|-----------------|----------------|
159
+ | **Round-Robin** | 0.469 | **0%** | 0% | 2.0 |
160
+ | **Heuristic** | 0.782 | **100%** | 6.6% | 3.3 |
161
+ | **Resilient** | 0.776 | **100%** | 4.3% | 3.0 |
162
+
163
+ **The critical finding:** Round-robin scores 0.469 overall but **0% SLA compliance**.
164
+ It routes shipments efficiently enough to earn step rewards, but completely ignores
165
+ delivery deadlines. An LLM that only learns to score 0.469 has learned nothing
166
+ about the actual task.
167
+
168
+ Heuristic achieves 100% SLA but still fails at priority service (6.6%) and breaks
169
+ down under cascading disruptions. That gap — **between reactive compliance and
170
+ proactive crisis management** — is exactly what GRPO training targets.
171
+ ### What Changed After Training
172
+
173
+ The most important evidence is qualitative, not quantitative.
174
+
175
+ **Before training**, the base model given the above prompt responds with:
176
+ ```
177
+ I would route the shipment to node 4 or maybe node 6 depending on load.
178
+ ```
179
+ No JSON. No explicit reasoning about the disruption. Unusable as an action.
180
+
181
+ **After training**, the same model responds with:
182
+ ```json
183
+ {
184
+ "reasoning": "Node 2 has a weather disruption with 3 steps remaining
185
+ and is near capacity. Node 5 (Warehouse Beta) has 44% load and buffer
186
+ capacity. Routing via node 5 avoids the congestion and reduces
187
+ cascade risk to DC Coastal.",
188
+ "source_node": 2,
189
+ "dest_node": 5,
190
+ "shipment_volume": 21.5
191
+ }
192
+ ```
193
+
194
+ The model learned to reason about disruption state, infer cascade risk,
195
+ and produce the correct routing decision. That is the capability this
196
+ environment was built to train.
197
+
198
+
199
+ ### Training Progress
200
+
201
+ The reward curve below shows the model improving from first steps after the SFT
202
+ warm-up. Because the model already knows JSON format, reward is non-zero from
203
+ step 1 and climbs steadily.
204
+
205
+ ![Reward Curve](reward_curve.png)
206
+ *GRPO training reward over 200 logging steps. After SFT warm-up,
207
+ the model starts producing valid structured actions immediately.*
208
+
209
+ ### LLM Training Evidence
210
+
211
+ The most concrete evidence of learning from GRPO training is the
212
+ **invalid action reduction on the Hard task: 24 → 7 (71% reduction)**,
213
+ confirming the model learned the legal route topology of the network
214
+ even under cascading disruption pressure.
215
+
216
+ Overall episode score improvement is modest at this compute scale
217
+ (0.5B model, 200 GRPO steps, free T4 GPU) — this environment is
218
+ intentionally hard enough that the full capability gap requires a
219
+ larger model. The reward curve confirms a non-zero learning signal
220
+ from the first step, which is the direct result of the SFT warm-up
221
+ solving the cold-start problem.
222
+
223
+
224
+ ### Before vs After GRPO
225
+
226
+ ![Before vs After](before_after_comparison.png)
227
+ *Policy comparison across all three task difficulties.
228
+ Green = trained model. Blue = base model. Amber = heuristic baseline.*
229
+
230
+ ---
231
+
232
+
233
+
234
+ ## Live Demo
235
+
236
+ The environment is hosted as a HuggingFace Space and you can interact with it directly:
237
+
238
+ 🔗 **Space:** https://roshan5emerald-logiflow-rl.hf.space/
239
+ 📓 **Colab notebook:** https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing (Runtime → Run all to reproduce training)
240
+ 💻 **GitHub:** https://github.com/Roshan5105labs/crisis-logistics-env
241
+
242
+ The Space exposes a live network visualizer at `/web` where you can watch
243
+ routing decisions play out across the 12-node diagram in real time.
244
+
245
+ ---
246
+
247
+ ## Try It Yourself
248
+
249
+ The full training run reproduces in ~45 minutes on a free Colab T4 GPU.
250
+ Open the [Colab notebook](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing),
251
+ hit Runtime → Run All, and watch the reward curve build live.
252
+
253
+ For local setup and API details, see the
254
+ [GitHub README](https://github.com/Roshan5105labs/crisis-logistics-env).
255
+
256
+ ---
257
+
258
+ ## Why It Matters
259
+
260
+ Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
261
+ An LLM trained on LogiFlow-RL learns to anticipate cascade effects, reason under partial
262
+ information, and make proactive routing decisions. These are not narrow logistics skills —
263
+ they are general planning capabilities that transfer to any domain where sequential
264
+ decisions have delayed consequences.
265
+
266
+ This environment exists to make that capability measurable, trainable, and improvable.
267
+
268
+ ---
269
+
270
+ *Built for the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
271
+
272
+ *[GitHub](https://github.com/Roshan5105labs/crisis-logistics-env) · [HF Space](https://roshan5emerald-logiflow-rl.hf.space/) · [Colab](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing)*