logiflow-rl / README.md
roshan5emerald's picture
Upload folder using huggingface_hub
07d1261 verified
---
title: LogiFlow-RL
emoji: "⭐"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
base_path: /web
---
# LogiFlow-RL — Smart Supply Chain Crisis Management
> **Training an LLM to route shipments proactively across a 12-node global supply chain — before disruptions cascade, not after.**
[![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-blue)](https://github.com/meta-pytorch/OpenEnv)
[![HuggingFace Space](https://img.shields.io/badge/🤗%20Space-Live-green)](https://huggingface.co/spaces/roshan5emerald/logiflow-rl)
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing)
[![Theme](https://img.shields.io/badge/Theme-Long--Horizon%20Planning-purple)](https://github.com/meta-pytorch/OpenEnv)
---
## The Problem
The 2021 Suez Canal blockage held up **$9 billion of goods every day**. Every major retailer missed
delivery SLAs that quarter. The root cause was not the blockage itself — it was that routing systems
identified the disruption **after** the cascade had already propagated: port backed up → upstream
warehouses overloaded → supplier shipments stalled → retail shelves empty.
Modern logistics software is fundamentally **reactive**. It alerts managers when things have already
gone wrong. What the industry needs is an agent that can read early warning signals — rising node
loads, congestion trends, disruption probability — and reroute **before** the cascade.
This is what LogiFlow-RL trains.
---
## What This Project Does
LogiFlow-RL is an **OpenEnv-compliant reinforcement learning environment** that simulates a
12-node global supply chain operating under stochastic disruptions. An LLM agent is trained via
GRPO to act as a proactive logistics crisis manager: observing partial network state, reasoning
about disruption trajectories, and routing shipments to prevent overloads before they cascade.
```
Suppliers → Warehouses → Distribution Centres → Retail Sinks
4 → 3 → 3 → 2
```
The environment is **genuinely hard to solve** — a round-robin baseline scores only 0.469 average
and achieves **0% SLA compliance**, because it cannot see far enough ahead to prioritise
time-sensitive shipments. Even a well-designed heuristic struggles on cascade scenarios.
---
## The Capability Gap Being Targeted
| What LLMs are good at today | What this environment trains |
|---|---|
| One-shot Q&A and summarisation | Multi-step sequential decisions |
| Full information, short context | Partial observability, long horizon |
| Static prompts | Dynamic world state that changes every step |
| Reactive reasoning | Anticipatory planning under uncertainty |
> **Research framing:** Could a researcher write a paper about training on this environment?
> Yes — it targets long-horizon planning under partial observability with delayed reward signals,
> a recognised capability gap in current LLM architectures.
---
## Environment Architecture
### Network Topology (12 nodes, 4 tiers)
```
[Node 0] Supplier North ──┐
[Node 1] Supplier West ──┼──► [Node 4] Warehouse Alpha ──► [Node 7] DC Metro ──► [Node 10] Retail North
[Node 2] Supplier Port ──┼──► [Node 5] Warehouse Beta ──► [Node 8] DC Central ──► ↕
[Node 3] Supplier Inland──┘──► [Node 6] Warehouse Gamma ──► [Node 9] DC Coastal ──► [Node 11] Retail South
```
Every node has: **capacity**, **current load**, **drain rate**, **risk score**, and
**typed connections** to downstream nodes. Freight takes **2–4 steps** to transit
between nodes — the agent must plan ahead, not just react.
### What Makes This Hard
**1. Partial observability.** The agent sees only nodes within 2 hops of the current
shipment source. Nodes beyond that radius appear as `null` in the observation. The agent
must infer hidden network state from what flows downstream — exactly like a real logistics
manager working from regional reports, not a global dashboard.
**2. Stochastic cascade disruptions.** Disruptions trigger probabilistically based on each
node's `risk_score` and the episode's `disruption_rate`. When a node is disrupted, connected
downstream nodes have a `cascade_rate` chance of also disrupting within 2 steps. These
cascades cannot be predicted or memorised — they require genuine situational reasoning.
**3. Priority-demand windows.** Certain shipments carry SLA deadlines and preferred retail
destinations. Missing a priority window is penalised proportionally to how late the delivery
arrives. The agent must balance general throughput against time-sensitive commitments.
**4. Dynamic pressure feedback.** The environment tracks a `dynamic_pressure` scalar that
combines overload ratio, SLA gap, and active disruptions. This pressure feeds back into
disruption probability and effective shipment volumes — creating a self-reinforcing difficulty
that rewards proactive management.
### Three Difficulty Tiers
| Task | Title | Steps | Disruption Rate | Cascade Rate | Objective |
|------|-------|-------|-----------------|--------------|-----------|
| **Easy** | Regional Network Balancing | 50 | 0.05 | 0.10 | Keep utilisation balanced while moving freight to retail within SLA |
| **Medium** | Flash Sale With Port Risk | 70 | 0.09 | 0.16 | Recover from burst demand and port slowdowns; prevent warehouse spillovers |
| **Hard** | Cascading Disruption Recovery | 90 | 0.12 | 0.22 | Stabilise a partially observable chain through weather events, supplier failures, and cascade disruptions |
### Action Space
At each step the agent receives a natural language observation and must output:
```json
{
"reasoning": "Port 2 is trending toward congestion. Warehouse Beta has 33% buffer capacity. Routing via Beta avoids the likely cascade.",
"source_node": 2,
"dest_node": 5,
"shipment_volume": 18.5
}
```
The `reasoning` field is not just cosmetic — it is **required** by the reward function and
is what judges and users actually see when demonstrating the trained model.
### Observation Space
```python
CrisisLogisticsObservation(
step_count = 14,
max_steps = 90,
visible_node_ids = [2, 5, 6, 8, 9], # 2-hop visibility only
observed_node_loads = [67.3, None, 44.1, None, 51.7, None, ...], # null = hidden
node_capacities = [90.0, None, 125.0, ...],
active_disruptions = [{"node": 2, "kind": "weather", "remaining_steps": 3}],
in_transit_shipments= [{"dest": 5, "volume": 14.2, "remaining_steps": 2}],
pending_source_node = 2,
incoming_load = 21.5,
dynamic_pressure = 0.38,
cumulative_score = 0.61,
last_reward = 0.72,
)
```
---
## Reward Design
The environment uses a **7-component weighted grader** to prevent reward hacking and
ensure every aspect of logistics performance is measured independently.
### Episode Grader (`graders.py`)
| Component | Weight | What It Measures |
|-----------|--------|-----------------|
| Bottleneck avoidance | 12% | How often any node exceeded capacity |
| Network balance | 10% | Average load-gap between most and least loaded nodes |
| Step reward | 10% | Average per-step reward across the episode |
| Retail delivery | 32% | Freight actually delivered to retail nodes vs target |
| SLA compliance | 20% | Deliveries arriving within their deadline window |
| Disruption recovery | 10% | How quickly the network stabilised after each disruption |
| Action validity | 6% | Fraction of legal (connected) routing decisions |
### Training Reward (`action_reward` in `train_grpo.py`)
The GRPO training reward is a 5-component verifiable reward:
| Component | Max | What It Checks |
|-----------|-----|---------------|
| Valid JSON | 0.20 | Output is parseable JSON |
| Required keys | 0.20 | All 4 fields present: reasoning, source, dest, volume |
| Correct source node | 0.20 | source_node matches the episode's current shipment |
| Connected destination | 0.25 | dest_node is a legal neighbour of source_node |
| Plausible volume | 0.15 | 0 < shipment_volume ≤ 60 and close to incoming load |
### Anti-Gaming Guards
- Reward only counts on **confirmed delivery**, not on dispatch
- **Route-repeat penalty** for consecutive identical routing decisions
- **Risk penalty** for routing through actively disrupted nodes
- **Overload penalty** applied even if JSON format is perfect
- All reward components are **independent** — gaming one does not inflate others
---
## Training
### Method: SFT Warm-Up → GRPO
Training uses a two-phase approach:
**Phase 1 — SFT Warm-Up (20 steps)**
Qwen2.5-0.5B-Instruct does not reliably output valid JSON from a cold start. A brief supervised
fine-tuning step on ideal routing examples teaches the model the output format. Without this,
GRPO sees reward = 0 for most early generations and cannot learn.
**Phase 2 — GRPO (200 steps)**
Starting from the SFT checkpoint, GRPO optimises the model against the verifiable reward function.
The model generates 4 completions per prompt; GRPO compares them within the group and pushes the
model toward higher-scoring routing decisions.
### Training Stack
```
OpenEnv environment → live rollout prompts → TRL GRPOTrainer
+ Unsloth (QLoRA r=16)
+ Qwen2.5-0.5B-Instruct
```
| Parameter | Value |
|-----------|-------|
| Base model | `Qwen/Qwen2.5-0.5B-Instruct` |
| Adapter | LoRA r=16, α=32 |
| Optimiser | GRPO via TRL |
| Max steps | 200 |
| Generations per prompt | 4 |
| Learning rate | 5e-6 |
| GPU | T4 (Colab free tier) |
| Total training time | ~45 minutes |
---
## Results
### Baseline Policy Comparison
The table below shows three hand-coded baselines evaluated on all three tasks
**before any LLM training**. These are the targets the trained model must beat.
| Policy | Avg Score | Avg SLA Rate | Avg Priority Service | Avg Invalid Actions |
|--------|-----------|-------------|---------------------|---------------------|
| **Round-Robin** | 0.469 | 0.0% | 0.0% | 2.0 |
| **Heuristic** | 0.782 | 100.0% | 6.6% | 3.3 |
| **Resilient** | 0.776 | 100.0% | 4.3% | 3.0 |
**Key insight:** Round-robin achieves 0% SLA success rate despite reasonable step rewards —
because it ignores delivery deadlines entirely. Heuristic achieves 100% SLA but still
fails on priority service (6.6%) and produces invalid actions under disruption.
The trained GRPO model targets both gaps.
### Per-Task Breakdown
| Task | Round-Robin | Heuristic | Resilient |
|------|------------|-----------|-----------|
| Easy | 0.473 | 0.768 | 0.761 |
| Medium | 0.472 | 0.763 | 0.752 |
| Hard | 0.461 | 0.814 | 0.814 |
### Training Evidence
The reward curve below shows GRPO training progress. After the SFT warm-up,
the model starts producing valid JSON immediately and reward climbs from the first steps.
![Reward Curve](image-1.png)
*Figure 1: GRPO training reward over 200 logging steps
![Before vs After](artifacts/before_after_comparison.png)
*Figure 2: Policy comparison across all three task difficulties.
![Metrics Panel](artifacts/metrics_panel.png)
*Figure 3: Detailed metrics breakdown — overall score, SLA rate, retail delivered, invalid
actions, and bottlenecks — for all three policies across all three tasks.*
![Training Loss](artifacts/Training_loss.png)
---
Training was run on Colab free-tier T4 GPU with Qwen2.5-0.5B-Instruct.
The most concrete evidence of learning is the **invalid action reduction
on Hard difficulty: 24 → 7 (71% reduction)**, confirming the model
learned the legal route topology of the network.
Overall episode score improvement is modest at this model scale —
this environment is intentionally hard enough that meaningful capability
gains require a 7B+ model with 500+ GRPO steps.
## What the Trained Agent Thinks
Below is an example of the trained Qwen2.5-0.5B model reasoning through a hard-task
disruption scenario at step 14. This is the chain-of-thought the model produces before
taking an action:
```
Situation: Port 2 is at 87% load with an active weather disruption (3 steps remaining).
Warehouse Beta has 44% load and 33% buffer capacity. 21.5 units incoming from Supplier Port.
Model output:
{
"reasoning": "Supplier Port (node 2) is experiencing a weather disruption with 3 steps
remaining and is near capacity at 87%. Routing through node 5 (Warehouse Beta) which
has significant buffer at 44% capacity and is not disrupted. This avoids contributing
to the congestion at node 2 and reduces cascade risk to downstream DC Coastal.",
"source_node": 2,
"dest_node": 5,
"shipment_volume": 21.5
}
```
The heuristic would route to the nearest available node. The trained model routes to the
node that minimises cascade probability — a fundamentally different reasoning pattern.
---
## Running Locally
### Start the environment server
```bash
git clone https://github.com/Roshan5105labs/crisis-logistics-env.git
cd crisis-logistics-env/crisis_logistics_env
pip install -e .
uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Test the environment (no LLM required)
```python
from crisis_logistics_env.server.crisis_logistics_env_environment import (
CrisisLogisticsEnvironment, choose_network_action
)
env = CrisisLogisticsEnvironment()
obs = env.reset(task_id="hard")
while not obs.done:
obs = env.step(choose_network_action(obs))
print(f"Score: {env.score:.3f}")
```
### Run the trained LLM agent
```bash
# Set your HuggingFace token for Qwen-72B inference
export HF_TOKEN=your_token_here
python inference.py
```
### Reproduce training
```bash
python train_grpo.py \
--model-name "Qwen/Qwen2.5-0.5B-Instruct" \
--max-steps 200 \
--output-dir "outputs/logiflow-grpo-script"
```
Or open the Colab notebook for a one-click reproducible run:
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing)
---
## API Endpoints
The environment is served as a FastAPI application and is fully OpenEnv-compliant.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Returns `{"status": "healthy"}` — judges use this to verify the Space is live |
| `/reset` | POST | Start a new episode. Body: `{"task_id": "easy"}` |
| `/step` | POST | Take one action. Body: `{"action": {"source_node": 2, "dest_node": 5, "shipment_volume": 18.5}}` |
| `/state` | GET | Full internal state (all 12 nodes visible, no partial observability) |
| `/schema` | GET | OpenAPI schema |
| `/web` | GET | Live network visualizer dashboard |
---
## Project Structure
```
crisis_logistics_env/
├── models.py # Action, Observation, State dataclasses
├── tasks.py # Task configs (easy / medium / hard)
├── graders.py # 7-component episode grader (0.0–1.0)
├── train_grpo.py # Production GRPO training script
├── inference.py # LLM agent loop (Qwen-72B via HF router)
├── train_and_evaluate.py # Baseline policy evaluation
├── gym_env.py # gymnasium.Env wrapper
├── client.py # HTTP client for server
├── server/
│ ├── app.py # FastAPI server (7 endpoints)
│ └── crisis_logistics_env_environment.py # World simulation engine
├── visualisation/
│ └── logiflow_visualizer.html # Live dashboard (served at /web)
├── notebooks/
│ └── logiflow_grpo_colab.ipynb # Reproducible training notebook
├── artifacts/
│ ├── benchmark_summary.json # Baseline policy results
│ ├── reward_curve.png # GRPO training curve
│ ├── before_after_comparison.png # Policy comparison chart
│ └── metrics_panel.png # Detailed metrics breakdown
├── openenv.yaml # OpenEnv manifest
└── Dockerfile # HuggingFace Space deployment
```
---
## Links
| Resource | Link |
|----------|------|
| 🤗 HuggingFace Space (live environment) | https://huggingface.co/spaces/roshan5emerald/logiflow-rl
| 📓 Colab Training Notebook | https://colab.research.google.com/drive/1wGXYNNYp13emNE1ThX3aqpIM3ppcU_Ty?usp=sharing |
| 📝 HuggingFace Blog Post | https://huggingface.co/spaces/roshan5emerald/logiflow-rl/blob/main/HF_MINI_BLOG.md |
---
## Why This Matters
Supply chain disruption costs the global economy an estimated **$1.5 trillion annually**.
The gap is not infrastructure — it is decision-making speed and anticipatory reasoning.
An LLM trained on LogiFlow-RL learns to:
- Read congestion signals before they become bottlenecks
- Reason about partial information the way a real logistics manager would
- Anticipate cascade effects from disruptions it cannot directly observe
- Balance competing priorities: throughput, SLA compliance, and network stability
This environment exists to teach LLMs something they currently cannot do well — and to
prove that teaching is measurable.
---
## Citation
```bibtex
@misc{logiflow-rl-2026,
title = {LogiFlow-RL: Training LLMs for Proactive Supply Chain Crisis Management},
author = {S. Roshan Pranao},
year = {2026},
howpublished = {OpenEnv Hackathon India 2026 — Theme \#2: Long-Horizon Planning},
url = {https://huggingface.co/spaces/roshan5emerald/logiflow-rl}
}
```
---
*Submitted to the Meta × PyTorch × OpenEnv × Scaler Hackathon India 2026 — Theme #2: Long-Horizon Planning & Instruction Following*