--- title: CI Triage Training emoji: ๐Ÿ‹๏ธ colorFrom: yellow colorTo: red sdk: docker app_port: 7860 hardware: a10g-small pinned: false --- --- title: CI Triage Env emoji: ๐Ÿ” colorFrom: blue colorTo: green sdk: docker app_port: 8000 pinned: false license: apache-2.0 tags: - openenv - reinforcement-learning - llm - tool-use - ci-cd - grpo - qwen3 --- # CI-Triage-Env **An OpenEnv RL environment for training LLMs to investigate ambiguous CI failures with verifiable, composable rewards.** > Built for the *Scaler ร— Meta ร— PyTorch OpenEnv Hackathon 2026.* > Team: Prasham Jain (lead), Sahil, Priyanshi Maheshwari. --- ## Links | Resource | URL | |---|---| | ๐Ÿค— **Environment Space (judge entrypoint)** | https://huggingface.co/spaces/Prasham1710/ci-triage-env | | ๐Ÿค— **Training Space** | https://huggingface.co/spaces/Prasham1710/ci-triage-training | | ๐Ÿ“ฆ **Scenarios dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-scenarios | | ๐Ÿ“ฆ **SFT trajectories dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-sft | | ๐Ÿง  **SFT checkpoint (Qwen3-4B + LoRA)** | https://huggingface.co/Prasham1710/ci-triage-agent-sft | | ๐Ÿ“ **Blog post (in this repo)** | [`ci-triage-blog-final.md`](ci-triage-blog-final.md) โ€” also being published as an HF Community Article | | ๐Ÿ““ **Training notebook** | [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb) | | ๐Ÿ’ป **Source code** | this repository | --- ## 1 ยท Problem โ€” *the capability gap* Modern CI pipelines fail for many ambiguous reasons: flaky tests, infra blips, real bugs, dependency drift, missing secrets, race conditions, test data rot. Today, an on-call engineer wastes ~30 min per failure pulling logs, checking flake history, blaming commits, and deciding whether to rerun, quarantine, or file a bug. **Frontier LLMs are not trained on this loop.** They can read a single log dump, but they don't *investigate*: they don't choose which tool to call next, they don't trade off cost vs. information, and they reach diagnoses without supporting evidence. We could not find a public RL environment that rewards an LLM for **multi-turn, evidence-grounded triage** under a budget. That is the gap CI-Triage-Env is built to close. ## 2 ยท Environment โ€” *what the agent sees, does, and is graded on* CI-Triage-Env is a fully OpenEnv-compliant `MCPEnvironment` with a Gym-style API (`reset` / `step` / `state`). ### Observation Each episode begins with a **scenario**: a synthesised CI failure (~3,500 generated, 200 hand-validated for the train split) containing a `failure_summary`, the failing test, a hidden `ground_truth_root_cause`, and an oracle "minimal evidence set" โ€” the smallest set of tools whose outputs together justify the correct diagnosis. ### Action space โ€” 11 MCP tools | Tool | What it does | Cost | |---|---|---| | `read_logs` | full CI log for a test | 1 | | `inspect_test_code` | test source | 1 | | `run_diagnostic` | scoped shell probe | 3 | | `cluster_metrics` | infra metrics for a window | 2 | | `query_flake_history` | flake stats for a test | 1 | | `recent_commits` | recent commits to repo/path | 1 | | `check_owner` | CODEOWNERS lookup | 0 | | `rerun_test` | quarantine probe (1ร—) | 5 | | `quarantine_test` | quarantines (terminal-ish) | โ€” | | `file_bug` | files a bug (terminal) | โ€” | | `ping_owner` | message owner (terminal) | โ€” | The agent operates under a **total cost budget per episode**. Each step returns the tool's structured output plus the remaining budget. ### Reward โ€” 9 composable, frozen-weight components Built using OpenEnv's rubric pattern (composition, not a monolith): | # | Component | Weight | Captures | |---|---|---|---| | 1 | `diagnosis` | 0.25 | Was the predicted root cause correct? | | 2 | `minimal_evidence` | 0.20 | Did the agent collect the oracle evidence set? | | 3 | `cost_efficiency` | 0.15 | Did it stay within budget? | | 4 | `action_quality` | 0.10 | Were tool calls well-formed and contextually sensible? | | 5 | `investigation` | 0.10 | Did the trace branch logically? | | 6 | `format_gate` | 0.05 | Strict JSON schema compliance | | 7 | `time_penalty` | 0.05 | Penalty for excessive turns | | 8 | `counterfactual_predict` | 0.05 | Was the predicted "fix" plausible vs ground truth? | | 9 | `anti_gaming` | 0.05 | Penalty for repeat / spam / quarantine-spam patterns | **Why this is hard to game:** every component is verifiable from the trace; the highest-weight terms (`diagnosis`, `minimal_evidence`) directly require the agent to *find the right cause with the right evidence*. The `anti_gaming` and `cost_efficiency` terms specifically punish the dominant exploit (always-quarantine; spam tools). Weights live in [src/ci_triage_env/rewards/weights.py](src/ci_triage_env/rewards/weights.py); replay verifier is in [src/ci_triage_env/rewards/replay.py](src/ci_triage_env/rewards/replay.py). ## 3 ยท Pipeline โ€” *training top to bottom* ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ 3,500 scenarios โ”‚ โ† clustered from real OSS CI logs โ”‚ (synth + LLM-aug) โ”‚ + LLM-generated edge cases โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ SFT trajectories โ”‚ teacher: GPT-4o-mini โ”‚ ~700 episodes โ”‚ filtered: reward โ‰ฅ 0.6 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Qwen3-4B + LoRA (r=16) via Unsloth โ”‚ โ”‚ SFT โ†’ 2 epochs, bf16, A10G Small โ”‚ โœ… DONE โ€” see plot โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ GRPO ยท TRL ยท multi-turn rollout โ”‚ โ”‚ reward = composite of 9 components โ”‚ โš  BLOCKED โ€” see "Status" โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Sources: - SFT trainer โ€” [src/ci_triage_env/training/sft.py](src/ci_triage_env/training/sft.py) - GRPO trainer โ€” [src/ci_triage_env/training/grpo.py](src/ci_triage_env/training/grpo.py) - Multi-turn rollout โ€” [src/ci_triage_env/training/rollout.py](src/ci_triage_env/training/rollout.py) - Composite reward โ€” [src/ci_triage_env/rewards/composite.py](src/ci_triage_env/rewards/composite.py) - Notebook judges can re-run โ€” [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb) ## 4 ยท Results & Evidence ### SFT โ€” completed, real run on A10G Small We trained Qwen3-4B + LoRA via Unsloth on 718 SFT trajectories for 2 epochs. The run completed end-to-end and the checkpoint is on the HF Hub. - **Checkpoint:** https://huggingface.co/Prasham1710/ci-triage-agent-sft - **W&B run:** https://wandb.ai/jainprasham17-esds/ci-triage-env - **Final training loss:** ~0.55 (from log line `180 0.548925`) - **Hardware:** A10G Small, 24 GB VRAM, ~50 min wall-clock ![SFT training loss โ€” Qwen3-4B + LoRA, 2 epochs on A10G Small](loss_graphs_sft.png) *Loss curve from the real SFT run โ€” smooth descent from ~1.4 โ†’ 0.55 over 180 steps.* Concrete log excerpt from the run: ``` Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained) ... 180 0.548925 [transformers] Unsloth: Restored added_tokens_decoder metadata in /data/checkpoints/sft/checkpoint-180/tokenizer_config.json. SFT done โ†’ /data/checkpoints/sft ``` ### GRPO โ€” environment + reward + rollout all built; blocked at trainer wiring Every component the GRPO loop needs is implemented and committed: - TRL `GRPOTrainer` integration in [grpo.py](src/ci_triage_env/training/grpo.py) - `MockEnvClient` for in-process rollouts (no network) in [mock_env_client.py](src/ci_triage_env/training/mock_env_client.py) - Multi-turn `TrainingRollout` calling the same composite reward - Frozen reward-component weights so curves are comparable We hit a chain of upstream version conflicts after the Qwen3-5 stack required `transformers v5` from git, which then required `torchao โ‰ฅ 0.13` (needed `torch โ‰ฅ 2.7`), which made Unsloth's fast-LoRA matmul kernel run with mismatched fp16/bf16 tensors during the GRPO forward pass: ``` RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16 at /opt/conda/lib/python3.11/site-packages/unsloth/kernels/utils.py:1059 in matmul_lora โ†’ out.addmm_(XA, B.to(dtype), alpha=s) ``` We did get **all 9 reward components computing real values on real trajectories** during MockEnvClient testing โ€” the reward signal is wired and shaped, just not yet connected to a policy gradient update. The plan/blocker is documented in detail above and reproducible from the notebook. ## 5 ยท How to run ### A. Use the environment (judges) ```bash # Pull and run the env server (it's a Docker Space) docker run -p 8000:8000 --pull always \ registry.hf.space/prasham1710-ci-triage-env:latest # Or clone & run locally git clone https://huggingface.co/spaces/Prasham1710/ci-triage-env cd ci-triage-env && docker build -t ci-triage-env . && \ docker run -p 8000:8000 ci-triage-env ``` Then talk to it: ```bash curl -X POST http://localhost:8000/reset # โ†’ initial obs curl -X POST http://localhost:8000/step \ -H 'Content-Type: application/json' \ -d '{"action":{"tool":"read_logs","args":{"test_name":"test_x"}}}' curl http://localhost:8000/state curl http://localhost:8000/docs # OpenAPI/Swagger ``` The Space also exposes the standard MCP route at `POST /mcp` (JSON-RPC) and `WS /mcp`. ### B. Reproduce training ```bash git clone https://github.com/.git cd CI-Triage-Env pip install -e ".[data,training]" # Set HF_TOKEN, HF_USERNAME, WANDB_API_KEY jupyter lab notebooks/train_grpo.ipynb ``` Or use the Training Space (preconfigured for A10G): https://huggingface.co/spaces/Prasham1710/ci-triage-training ## 6 ยท Engineering hygiene - โœ… OpenEnv `MCPEnvironment` base class, valid `openenv.yaml` - โœ… Standard Gym API: `reset`, `step`, `state` - โœ… MCP tool names โ€” none collide with reserved names (`reset`/`step`/`state`/`close`) - โœ… Strict client/server separation (server in `env/`, clients import only the wire schemas) - โœ… JSON-Schema validated action/observation envelopes ([schemas/](src/ci_triage_env/schemas/)) - โœ… FastAPI `/docs` for interactive exploration ## 7 ยท Why it matters Every shop running CI burns engineering hours on triage. If a 4B-parameter LLM can do this reliably, that's hours back per on-call shift, plus a paper trail of *why* the diagnosis was made (the trace itself). The methodology โ€” composable rubric rewards over multi-turn tool use against a budget โ€” generalizes well beyond CI: incident response, code review triage, security alert triage all share the structure. ## 8 ยท Status (April 26 2026, submission day) - โœ… Environment fully implemented + deployed to HF Space - โœ… 3,500 scenarios + 700+ SFT trajectories generated, validated, published - โœ… All 9 reward components implemented and replay-verified - โœ… SFT warmstart trained end-to-end (Qwen3-4B + LoRA, 2 epochs, A10G Small) - โš  GRPO loop blocked by Unsloth/torchao/transformers-v5 fp16/bf16 mismatch in `matmul_lora`; pipeline + reward signal verified separately - ๐Ÿšง Inference UI (Streamlit) โ€” out of time ## 9 ยท License & credits Apache-2.0. Built with OpenEnv, Unsloth, TRL, Hugging Face Transformers, PyTorch. Data scenarios are synthetic; no proprietary CI logs are included. ## Blog We're publishing a companion mini-blog on Hugging Face explaining the environment design, the rubric reward, and what we learned. **Link will be inserted here once published** โ€” judges, please check the top-of-page **Links** table.