Spaces:
Sleeping
Sleeping
| title: CI Triage Env | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| --- | |
| title: CI Triage Env | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: false | |
| license: apache-2.0 | |
| tags: | |
| - openenv | |
| - reinforcement-learning | |
| - llm | |
| - tool-use | |
| - ci-cd | |
| - grpo | |
| - qwen3 | |
| --- | |
| # CI-Triage-Env | |
| **An OpenEnv RL environment for training LLMs to investigate ambiguous CI failures with verifiable, composable rewards.** | |
| > Built for the *Scaler Γ Meta Γ PyTorch OpenEnv Hackathon 2026.* | |
| > Team: Prasham Jain (lead), Sahil, Priyanshi Maheshwari. | |
| --- | |
| ## Links | |
| | Resource | URL | | |
| |---|---| | |
| | π€ **Environment Space (judge entrypoint)** | https://huggingface.co/spaces/Prasham1710/ci-triage-env | | |
| | π€ **Training Space** | https://huggingface.co/spaces/Prasham1710/ci-triage-training | | |
| | π¦ **Scenarios dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-scenarios | | |
| | π¦ **SFT trajectories dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-sft | | |
| | π§ **SFT checkpoint (Qwen3-4B + LoRA)** | https://huggingface.co/Prasham1710/ci-triage-agent-sft | | |
| | π **Blog post (in this repo)** | [`ci-triage-blog-final.md`](ci-triage-blog-final.md) β also being published as an HF Community Article | | |
| | π **Training notebook** | [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb) | | |
| | π» **Source code** | this repository | | |
| --- | |
| ## 1 Β· Problem β *the capability gap* | |
| Modern CI pipelines fail for many ambiguous reasons: flaky tests, infra blips, real bugs, dependency drift, missing secrets, race conditions, test data rot. Today, an on-call engineer wastes ~30 min per failure pulling logs, checking flake history, blaming commits, and deciding whether to rerun, quarantine, or file a bug. | |
| **Frontier LLMs are not trained on this loop.** They can read a single log dump, but they don't *investigate*: they don't choose which tool to call next, they don't trade off cost vs. information, and they reach diagnoses without supporting evidence. We could not find a public RL environment that rewards an LLM for **multi-turn, evidence-grounded triage** under a budget. | |
| That is the gap CI-Triage-Env is built to close. | |
| ## 2 Β· Environment β *what the agent sees, does, and is graded on* | |
| CI-Triage-Env is a fully OpenEnv-compliant `MCPEnvironment` with a Gym-style API (`reset` / `step` / `state`). | |
| ### Observation | |
| Each episode begins with a **scenario**: a synthesised CI failure (~3,500 generated, 200 hand-validated for the train split) containing a `failure_summary`, the failing test, a hidden `ground_truth_root_cause`, and an oracle "minimal evidence set" β the smallest set of tools whose outputs together justify the correct diagnosis. | |
| ### Action space β 11 MCP tools | |
| | Tool | What it does | Cost | | |
| |---|---|---| | |
| | `read_logs` | full CI log for a test | 1 | | |
| | `inspect_test_code` | test source | 1 | | |
| | `run_diagnostic` | scoped shell probe | 3 | | |
| | `cluster_metrics` | infra metrics for a window | 2 | | |
| | `query_flake_history` | flake stats for a test | 1 | | |
| | `recent_commits` | recent commits to repo/path | 1 | | |
| | `check_owner` | CODEOWNERS lookup | 0 | | |
| | `rerun_test` | quarantine probe (1Γ) | 5 | | |
| | `quarantine_test` | quarantines (terminal-ish) | β | | |
| | `file_bug` | files a bug (terminal) | β | | |
| | `ping_owner` | message owner (terminal) | β | | |
| The agent operates under a **total cost budget per episode**. Each step returns the tool's structured output plus the remaining budget. | |
| ### Reward β 9 composable, frozen-weight components | |
| Built using OpenEnv's rubric pattern (composition, not a monolith): | |
| | # | Component | Weight | Captures | | |
| |---|---|---|---| | |
| | 1 | `diagnosis` | 0.25 | Was the predicted root cause correct? | | |
| | 2 | `minimal_evidence` | 0.20 | Did the agent collect the oracle evidence set? | | |
| | 3 | `cost_efficiency` | 0.15 | Did it stay within budget? | | |
| | 4 | `action_quality` | 0.10 | Were tool calls well-formed and contextually sensible? | | |
| | 5 | `investigation` | 0.10 | Did the trace branch logically? | | |
| | 6 | `format_gate` | 0.05 | Strict JSON schema compliance | | |
| | 7 | `time_penalty` | 0.05 | Penalty for excessive turns | | |
| | 8 | `counterfactual_predict` | 0.05 | Was the predicted "fix" plausible vs ground truth? | | |
| | 9 | `anti_gaming` | 0.05 | Penalty for repeat / spam / quarantine-spam patterns | | |
| **Why this is hard to game:** every component is verifiable from the trace; the highest-weight terms (`diagnosis`, `minimal_evidence`) directly require the agent to *find the right cause with the right evidence*. The `anti_gaming` and `cost_efficiency` terms specifically punish the dominant exploit (always-quarantine; spam tools). | |
| Weights live in [src/ci_triage_env/rewards/weights.py](src/ci_triage_env/rewards/weights.py); replay verifier is in [src/ci_triage_env/rewards/replay.py](src/ci_triage_env/rewards/replay.py). | |
| ## 3 Β· Pipeline β *training top to bottom* | |
| ``` | |
| βββββββββββββββββββββββ | |
| β 3,500 scenarios β β clustered from real OSS CI logs | |
| β (synth + LLM-aug) β + LLM-generated edge cases | |
| ββββββββββββ¬βββββββββββ | |
| β | |
| βββββββββββΌβββββββββββ | |
| β SFT trajectories β teacher: GPT-4o-mini | |
| β ~700 episodes β filtered: reward β₯ 0.6 | |
| βββββββββββ¬βββββββββββ | |
| β | |
| βββββββββββββββββββββΌββββββββββββββββββββ | |
| β Qwen3-4B + LoRA (r=16) via Unsloth β | |
| β SFT β 2 epochs, bf16, A10G Small β β DONE β see plot | |
| βββββββββββββββββββββ¬ββββββββββββββββββββ | |
| β | |
| βββββββββββββββββββββΌββββββββββββββββββββ | |
| β GRPO Β· TRL Β· multi-turn rollout β | |
| β reward = composite of 9 components β β BLOCKED β see "Status" | |
| βββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| Sources: | |
| - SFT trainer β [src/ci_triage_env/training/sft.py](src/ci_triage_env/training/sft.py) | |
| - GRPO trainer β [src/ci_triage_env/training/grpo.py](src/ci_triage_env/training/grpo.py) | |
| - Multi-turn rollout β [src/ci_triage_env/training/rollout.py](src/ci_triage_env/training/rollout.py) | |
| - Composite reward β [src/ci_triage_env/rewards/composite.py](src/ci_triage_env/rewards/composite.py) | |
| - Notebook judges can re-run β [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb) | |
| ## 4 Β· Results & Evidence | |
| ### SFT β completed, real run on A10G Small | |
| We trained Qwen3-4B + LoRA via Unsloth on 718 SFT trajectories for 2 epochs. The run completed end-to-end and the checkpoint is on the HF Hub. | |
| - **Checkpoint:** https://huggingface.co/Prasham1710/ci-triage-agent-sft | |
| - **W&B run:** https://wandb.ai/jainprasham17-esds/ci-triage-env | |
| - **Final training loss:** ~0.55 (from log line `180 0.548925`) | |
| - **Hardware:** A10G Small, 24 GB VRAM, ~50 min wall-clock | |
|  | |
| *Loss curve from the real SFT run β smooth descent from ~1.4 β 0.55 over 180 steps.* | |
| Concrete log excerpt from the run: | |
| ``` | |
| Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained) | |
| ... | |
| 180 0.548925 | |
| [transformers] Unsloth: Restored added_tokens_decoder metadata in /data/checkpoints/sft/checkpoint-180/tokenizer_config.json. | |
| SFT done β /data/checkpoints/sft | |
| ``` | |
| ### GRPO β environment + reward + rollout all built; blocked at trainer wiring | |
| Every component the GRPO loop needs is implemented and committed: | |
| - TRL `GRPOTrainer` integration in [grpo.py](src/ci_triage_env/training/grpo.py) | |
| - `MockEnvClient` for in-process rollouts (no network) in [mock_env_client.py](src/ci_triage_env/training/mock_env_client.py) | |
| - Multi-turn `TrainingRollout` calling the same composite reward | |
| - Frozen reward-component weights so curves are comparable | |
| We hit a chain of upstream version conflicts after the Qwen3-5 stack required `transformers v5` from git, which then required `torchao β₯ 0.13` (needed `torch β₯ 2.7`), which made Unsloth's fast-LoRA matmul kernel run with mismatched fp16/bf16 tensors during the GRPO forward pass: | |
| ``` | |
| RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16 | |
| at /opt/conda/lib/python3.11/site-packages/unsloth/kernels/utils.py:1059 | |
| in matmul_lora β out.addmm_(XA, B.to(dtype), alpha=s) | |
| ``` | |
| We did get **all 9 reward components computing real values on real trajectories** during MockEnvClient testing β the reward signal is wired and shaped, just not yet connected to a policy gradient update. The plan/blocker is documented in detail above and reproducible from the notebook. | |
| ## 5 Β· How to run | |
| ### A. Use the environment (judges) | |
| ```bash | |
| # Pull and run the env server (it's a Docker Space) | |
| docker run -p 8000:8000 --pull always \ | |
| registry.hf.space/prasham1710-ci-triage-env:latest | |
| # Or clone & run locally | |
| git clone https://huggingface.co/spaces/Prasham1710/ci-triage-env | |
| cd ci-triage-env && docker build -t ci-triage-env . && \ | |
| docker run -p 8000:8000 ci-triage-env | |
| ``` | |
| Then talk to it: | |
| ```bash | |
| curl -X POST http://localhost:8000/reset # β initial obs | |
| curl -X POST http://localhost:8000/step \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"action":{"tool":"read_logs","args":{"test_name":"test_x"}}}' | |
| curl http://localhost:8000/state | |
| curl http://localhost:8000/docs # OpenAPI/Swagger | |
| ``` | |
| The Space also exposes the standard MCP route at `POST /mcp` (JSON-RPC) and `WS /mcp`. | |
| ### B. Reproduce training | |
| ```bash | |
| git clone https://github.com/<this-repo>.git | |
| cd CI-Triage-Env | |
| pip install -e ".[data,training]" | |
| # Set HF_TOKEN, HF_USERNAME, WANDB_API_KEY | |
| jupyter lab notebooks/train_grpo.ipynb | |
| ``` | |
| Or use the Training Space (preconfigured for A10G): | |
| https://huggingface.co/spaces/Prasham1710/ci-triage-training | |
| ## 6 Β· Engineering hygiene | |
| - β OpenEnv `MCPEnvironment` base class, valid `openenv.yaml` | |
| - β Standard Gym API: `reset`, `step`, `state` | |
| - β MCP tool names β none collide with reserved names (`reset`/`step`/`state`/`close`) | |
| - β Strict client/server separation (server in `env/`, clients import only the wire schemas) | |
| - β JSON-Schema validated action/observation envelopes ([schemas/](src/ci_triage_env/schemas/)) | |
| - β FastAPI `/docs` for interactive exploration | |
| ## 7 Β· Why it matters | |
| Every shop running CI burns engineering hours on triage. If a 4B-parameter LLM can do this reliably, that's hours back per on-call shift, plus a paper trail of *why* the diagnosis was made (the trace itself). The methodology β composable rubric rewards over multi-turn tool use against a budget β generalizes well beyond CI: incident response, code review triage, security alert triage all share the structure. | |
| ## 8 Β· Status (April 26 2026, submission day) | |
| - β Environment fully implemented + deployed to HF Space | |
| - β 3,500 scenarios + 700+ SFT trajectories generated, validated, published | |
| - β All 9 reward components implemented and replay-verified | |
| - β SFT warmstart trained end-to-end (Qwen3-4B + LoRA, 2 epochs, A10G Small) | |
| - β GRPO loop blocked by Unsloth/torchao/transformers-v5 fp16/bf16 mismatch in `matmul_lora`; pipeline + reward signal verified separately | |
| - π§ Inference UI (Streamlit) β out of time | |
| ## 9 Β· License & credits | |
| Apache-2.0. Built with OpenEnv, Unsloth, TRL, Hugging Face Transformers, PyTorch. | |
| Data scenarios are synthetic; no proprietary CI logs are included. | |
| ## Blog | |
| We're publishing a companion mini-blog on Hugging Face explaining the environment design, the rubric reward, and what we learned. **Link will be inserted here once published** β judges, please check the top-of-page **Links** table. | |