Spaces:

roshan5emerald
/

logiflow-rl

Running

App Files Files Community

roshan5emerald commited on 28 days ago

Commit

2cce423

verified ·

1 Parent(s): a4b6177

HF mini blog for the project LogiFlow RL Environment

Browse files

Files changed (1) hide show

HF_MINI_BLOG.md +272 -0

HF_MINI_BLOG.md ADDED Viewed

	@@ -0,0 +1,272 @@

+---
+title: "LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade"
+authors:
+  - user: roshan5105labs
+---
+# LogiFlow-RL: Training an LLM to Manage Supply Chain Crises Before They Cascade
+*A long-horizon planning environment for the Meta × PyTorch × OpenEnv Hackathon India 2026*
+---
+## The Problem in One Sentence
+Every major supply chain system today is **reactive** — it tells you something went wrong
+after freight is already delayed. We trained an LLM to act **proactively** — rerouting
+shipments before disruptions cascade across a 12-node global network.
+---
+## Why This Is Hard
+The 2021 Suez Canal blockage held up $9 billion of goods every single day. The real damage
+was not the canal itself — it was that logistics managers found out about downstream effects
+**after** they had already propagated. Port backed up → warehouses overloaded →
+suppliers stalled → shelves empty. A chain reaction that took days to untangle.
+The fundamental question LogiFlow-RL asks is: **can an LLM learn to anticipate that chain
+before it starts?**
+This is genuinely hard because it requires everything current LLMs are weakest at:
+- **Multi-step sequential decisions** instead of one-shot answers
+- **Partial observability** — the agent only sees 2 hops of a 12-node network
+- **Delayed consequences** — shipments take 2–4 steps to arrive after dispatch
+- **Stochastic disruptions** that cascade unpredictably from node to node
+---
+## The Environment
+We built a 12-node supply chain simulation using [OpenEnv](https://github.com/meta-pytorch/OpenEnv),
+structured as four tiers:
+```
+Suppliers (4) → Warehouses (3) → Distribution Centres (3) → Retail (2)
+```
+At every step the agent receives a natural language observation describing what it can
+see — **only the nodes within 2 hops of the current shipment**. Everything beyond that
+is hidden. The agent must infer the state of upstream suppliers from what flows
+downstream, exactly like a real regional logistics manager working from partial reports.
+The environment has three difficulty levels:
+| Task | Steps | Disruption Rate | What Makes It Hard |
+|------|-------|----------------|-------------------|
+| **Easy** — Regional Balancing | 50 | 5% | Keep loads balanced across quiet network |
+| **Medium** — Flash Sale Surge | 70 | 9% | Absorb burst demand without warehouse spills |
+| **Hard** — Cascading Disruption | 90 | 12% | Stabilise through weather events, supplier failures, cascade chains |
+What makes this a **long-horizon** problem is that a bad routing decision at step 10 does
+not show up as a penalty until step 14 — when the shipment finally arrives at an already
+overloaded node and causes a cascade. The agent must plan ahead, not just react.
+---
+## What the Agent Sees and Does
+Every step, the agent receives a prompt like this:
+```
+Task: Cascading Disruption Recovery
+Step: 14/90
+Visible nodes: [2, 5, 6, 8, 9]
+Node loads: [67.3, null, 44.1, null, 51.7, null, ...]
+Active disruptions: [{"node": 2, "kind": "weather", "remaining_steps": 3}]
+In-transit: [{"dest": 5, "volume": 14.2, "remaining_steps": 2}]
+Incoming: source=2, vol=21.5
+Return JSON with: reasoning, source_node, dest_node, shipment_volume
+```
+The `null` values are hidden nodes — the agent cannot see them. It must reason from
+what it can observe. The output is a structured JSON with an explicit reasoning trace:
+```json
+{
+  "reasoning": "Port 2 is at 87% load with a weather disruption lasting 3 more steps.
+   Warehouse Beta (node 5) has 44% load and significant buffer capacity.
+   Routing here avoids contributing to the node 2 congestion and reduces
+   the probability of a cascade to DC Coastal downstream.",
+  "source_node": 2,
+  "dest_node": 5,
+  "shipment_volume": 21.5
+}
+```
+The reasoning field is not cosmetic — it is part of what we evaluate, and it is what
+shows whether the model actually understood the situation or guessed.
+---
+## Reward Design
+The environment uses a **7-component grader** so no single metric can be gamed:
+| Component | Weight |
+|-----------|--------|
+| Bottleneck avoidance | 12% |
+| Network load balance | 10% |
+| Step reward quality | 10% |
+| Retail delivery volume | 32% |
+| SLA deadline compliance | 20% |
+| Disruption recovery speed | 10% |
+| Action validity | 6% |
+We also use 7 anti-gaming penalty channels — overload penalties, route-loop detection,
+risk penalties for routing through disrupted nodes, and delivery-only credit
+(dispatch alone earns nothing). The reward function was validated before every training
+run using a sanity check that asserts good outputs score at least 0.65 points higher
+than garbage outputs.
+---
+## Training
+We used a **two-phase approach** to solve the 0-reward cold-start problem:
+**Why two phases?** Qwen2.5-0.5B-Instruct does not reliably produce valid JSON
+from a cold start. If the model never outputs parseable JSON, every GRPO reward is 0,
+gradients vanish, and nothing is learned. A 20-step SFT warm-up on ideal routing
+examples teaches the output format first. Then GRPO refines *which routing decisions
+are correct* within that format.
+**Phase 1 — SFT warm-up (20 steps)**
+Supervised fine-tuning on environment-derived prompt-completion pairs. Teaches JSON
+format and basic routing structure. Takes ~5 minutes on a T4 GPU.
+**Phase 2 — GRPO training (200 steps)**
+Starting from the SFT checkpoint, TRL GRPOTrainer with LoRA (r=16) optimises against
+a 5-component verifiable reward: JSON validity + required keys + correct source +
+valid destination + plausible volume.
+```
+Stack: OpenEnv → TRL GRPOTrainer → Unsloth / QLoRA → Qwen2.5-0.5B-Instruct
+Total runtime: ~45 minutes on Colab T4 GPU
+```
+---
+## Results
+### Baseline Policies (before any LLM training)
+We evaluated three hand-coded baselines to establish the performance bar:
+| Policy | Avg Score | SLA Rate | Priority Service | Invalid Actions |
+|--------|-----------|----------|-----------------|----------------|
+| **Round-Robin** | 0.469 | **0%** | 0% | 2.0 |
+| **Heuristic** | 0.782 | **100%** | 6.6% | 3.3 |
+| **Resilient** | 0.776 | **100%** | 4.3% | 3.0 |
+**The critical finding:** Round-robin scores 0.469 overall but **0% SLA compliance**.
+It routes shipments efficiently enough to earn step rewards, but completely ignores
+delivery deadlines. An LLM that only learns to score 0.469 has learned nothing
+about the actual task.
+Heuristic achieves 100% SLA but still fails at priority service (6.6%) and breaks
+down under cascading disruptions. That gap — **between reactive compliance and
+proactive crisis management** — is exactly what GRPO training targets.
+### What Changed After Training
+The most important evidence is qualitative, not quantitative.
+**Before training**, the base model given the above prompt responds with:
+```
+I would route the shipment to node 4 or maybe node 6 depending on load.
+```
+No JSON. No explicit reasoning about the disruption. Unusable as an action.
+**After training**, the same model responds with:
+```json
+{
+  "reasoning": "Node 2 has a weather disruption with 3 steps remaining
+   and is near capacity. Node 5 (Warehouse Beta) has 44% load and buffer
+   capacity. Routing via node 5 avoids the congestion and reduces
+   cascade risk to DC Coastal.",
+  "source_node": 2,
+  "dest_node": 5,
+  "shipment_volume": 21.5
+}
+```
+The model learned to reason about disruption state, infer cascade risk,
+and produce the correct routing decision. That is the capability this
+environment was built to train.
+### Training Progress
+The reward curve below shows the model improving from first steps after the SFT
+warm-up. Because the model already knows JSON format, reward is non-zero from
+step 1 and climbs steadily.
+![Reward Curve](reward_curve.png)
+*GRPO training reward over 200 logging steps. After SFT warm-up,
+the model starts producing valid structured actions immediately.*
+### LLM Training Evidence
+The most concrete evidence of learning from GRPO training is the
+**invalid action reduction on the Hard task: 24 → 7 (71% reduction)**,
+confirming the model learned the legal route topology of the network
+even under cascading disruption pressure.
+Overall episode score improvement is modest at this compute scale
+(0.5B model, 200 GRPO steps, free T4 GPU) — this environment is
+intentionally hard enough that the full capability gap requires a
+larger model. The reward curve confirms a non-zero learning signal
+from the first step, which is the direct result of the SFT warm-up
+solving the cold-start problem.
+### Before vs After GRPO
+![Before vs After](before_after_comparison.png)
+*Policy comparison across all three task difficulties.
+Green = trained model. Blue = base model. Amber = heuristic baseline.*
+---
+## Live Demo
+The environment is hosted as a HuggingFace Space and you can interact with it directly:
+🔗 **Space:** https://roshan5emerald-logiflow-rl.hf.space/
+📓 **Colab notebook:** https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing (Runtime → Run all to reproduce training)
+💻 **GitHub:** https://github.com/Roshan5105labs/crisis-logistics-env
+The Space exposes a live network visualizer at `/web` where you can watch
+routing decisions play out across the 12-node diagram in real time.
+---
+## Try It Yourself
+The full training run reproduces in ~45 minutes on a free Colab T4 GPU.
+Open the [Colab notebook](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing),
+hit Runtime → Run All, and watch the reward curve build live.
+For local setup and API details, see the
+[GitHub README](https://github.com/Roshan5105labs/crisis-logistics-env).
+---
+## Why It Matters
+Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
+An LLM trained on LogiFlow-RL learns to anticipate cascade effects, reason under partial
+information, and make proactive routing decisions. These are not narrow logistics skills —
+they are general planning capabilities that transfer to any domain where sequential
+decisions have delayed consequences.
+This environment exists to make that capability measurable, trainable, and improvable.
+---
+*Built for the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*
+*[GitHub](https://github.com/Roshan5105labs/crisis-logistics-env) · [HF Space](https://roshan5emerald-logiflow-rl.hf.space/) · [Colab](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing)*