ci-triage-training / README.md
Prasham.Jain
chore(spaces): add HF Space config for training [skip ci]
bd95811
---
title: CI Triage Training
emoji: πŸ‹οΈ
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
hardware: a10g-small
pinned: false
---
---
title: CI Triage Env
emoji: πŸ”
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false
license: apache-2.0
tags:
- openenv
- reinforcement-learning
- llm
- tool-use
- ci-cd
- grpo
- qwen3
---
# CI-Triage-Env
**An OpenEnv RL environment for training LLMs to investigate ambiguous CI failures with verifiable, composable rewards.**
> Built for the *Scaler Γ— Meta Γ— PyTorch OpenEnv Hackathon 2026.*
> Team: Prasham Jain (lead), Sahil, Priyanshi Maheshwari.
---
## Links
| Resource | URL |
|---|---|
| πŸ€— **Environment Space (judge entrypoint)** | https://huggingface.co/spaces/Prasham1710/ci-triage-env |
| πŸ€— **Training Space** | https://huggingface.co/spaces/Prasham1710/ci-triage-training |
| πŸ“¦ **Scenarios dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-scenarios |
| πŸ“¦ **SFT trajectories dataset** | https://huggingface.co/datasets/Prasham1710/ci-triage-sft |
| 🧠 **SFT checkpoint (Qwen3-4B + LoRA)** | https://huggingface.co/Prasham1710/ci-triage-agent-sft |
| πŸ“ **Blog post (in this repo)** | [`ci-triage-blog-final.md`](ci-triage-blog-final.md) β€” also being published as an HF Community Article |
| πŸ““ **Training notebook** | [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb) |
| πŸ’» **Source code** | this repository |
---
## 1 Β· Problem β€” *the capability gap*
Modern CI pipelines fail for many ambiguous reasons: flaky tests, infra blips, real bugs, dependency drift, missing secrets, race conditions, test data rot. Today, an on-call engineer wastes ~30 min per failure pulling logs, checking flake history, blaming commits, and deciding whether to rerun, quarantine, or file a bug.
**Frontier LLMs are not trained on this loop.** They can read a single log dump, but they don't *investigate*: they don't choose which tool to call next, they don't trade off cost vs. information, and they reach diagnoses without supporting evidence. We could not find a public RL environment that rewards an LLM for **multi-turn, evidence-grounded triage** under a budget.
That is the gap CI-Triage-Env is built to close.
## 2 Β· Environment β€” *what the agent sees, does, and is graded on*
CI-Triage-Env is a fully OpenEnv-compliant `MCPEnvironment` with a Gym-style API (`reset` / `step` / `state`).
### Observation
Each episode begins with a **scenario**: a synthesised CI failure (~3,500 generated, 200 hand-validated for the train split) containing a `failure_summary`, the failing test, a hidden `ground_truth_root_cause`, and an oracle "minimal evidence set" β€” the smallest set of tools whose outputs together justify the correct diagnosis.
### Action space β€” 11 MCP tools
| Tool | What it does | Cost |
|---|---|---|
| `read_logs` | full CI log for a test | 1 |
| `inspect_test_code` | test source | 1 |
| `run_diagnostic` | scoped shell probe | 3 |
| `cluster_metrics` | infra metrics for a window | 2 |
| `query_flake_history` | flake stats for a test | 1 |
| `recent_commits` | recent commits to repo/path | 1 |
| `check_owner` | CODEOWNERS lookup | 0 |
| `rerun_test` | quarantine probe (1Γ—) | 5 |
| `quarantine_test` | quarantines (terminal-ish) | β€” |
| `file_bug` | files a bug (terminal) | β€” |
| `ping_owner` | message owner (terminal) | β€” |
The agent operates under a **total cost budget per episode**. Each step returns the tool's structured output plus the remaining budget.
### Reward β€” 9 composable, frozen-weight components
Built using OpenEnv's rubric pattern (composition, not a monolith):
| # | Component | Weight | Captures |
|---|---|---|---|
| 1 | `diagnosis` | 0.25 | Was the predicted root cause correct? |
| 2 | `minimal_evidence` | 0.20 | Did the agent collect the oracle evidence set? |
| 3 | `cost_efficiency` | 0.15 | Did it stay within budget? |
| 4 | `action_quality` | 0.10 | Were tool calls well-formed and contextually sensible? |
| 5 | `investigation` | 0.10 | Did the trace branch logically? |
| 6 | `format_gate` | 0.05 | Strict JSON schema compliance |
| 7 | `time_penalty` | 0.05 | Penalty for excessive turns |
| 8 | `counterfactual_predict` | 0.05 | Was the predicted "fix" plausible vs ground truth? |
| 9 | `anti_gaming` | 0.05 | Penalty for repeat / spam / quarantine-spam patterns |
**Why this is hard to game:** every component is verifiable from the trace; the highest-weight terms (`diagnosis`, `minimal_evidence`) directly require the agent to *find the right cause with the right evidence*. The `anti_gaming` and `cost_efficiency` terms specifically punish the dominant exploit (always-quarantine; spam tools).
Weights live in [src/ci_triage_env/rewards/weights.py](src/ci_triage_env/rewards/weights.py); replay verifier is in [src/ci_triage_env/rewards/replay.py](src/ci_triage_env/rewards/replay.py).
## 3 Β· Pipeline β€” *training top to bottom*
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 3,500 scenarios β”‚ ← clustered from real OSS CI logs
β”‚ (synth + LLM-aug) β”‚ + LLM-generated edge cases
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SFT trajectories β”‚ teacher: GPT-4o-mini
β”‚ ~700 episodes β”‚ filtered: reward β‰₯ 0.6
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Qwen3-4B + LoRA (r=16) via Unsloth β”‚
β”‚ SFT β†’ 2 epochs, bf16, A10G Small β”‚ βœ… DONE β€” see plot
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ GRPO Β· TRL Β· multi-turn rollout β”‚
β”‚ reward = composite of 9 components β”‚ ⚠ BLOCKED β€” see "Status"
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Sources:
- SFT trainer β€” [src/ci_triage_env/training/sft.py](src/ci_triage_env/training/sft.py)
- GRPO trainer β€” [src/ci_triage_env/training/grpo.py](src/ci_triage_env/training/grpo.py)
- Multi-turn rollout β€” [src/ci_triage_env/training/rollout.py](src/ci_triage_env/training/rollout.py)
- Composite reward β€” [src/ci_triage_env/rewards/composite.py](src/ci_triage_env/rewards/composite.py)
- Notebook judges can re-run β€” [notebooks/train_grpo.ipynb](notebooks/train_grpo.ipynb)
## 4 Β· Results & Evidence
### SFT β€” completed, real run on A10G Small
We trained Qwen3-4B + LoRA via Unsloth on 718 SFT trajectories for 2 epochs. The run completed end-to-end and the checkpoint is on the HF Hub.
- **Checkpoint:** https://huggingface.co/Prasham1710/ci-triage-agent-sft
- **W&B run:** https://wandb.ai/jainprasham17-esds/ci-triage-env
- **Final training loss:** ~0.55 (from log line `180 0.548925`)
- **Hardware:** A10G Small, 24 GB VRAM, ~50 min wall-clock
![SFT training loss β€” Qwen3-4B + LoRA, 2 epochs on A10G Small](loss_graphs_sft.png)
*Loss curve from the real SFT run β€” smooth descent from ~1.4 β†’ 0.55 over 180 steps.*
Concrete log excerpt from the run:
```
Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained)
...
180 0.548925
[transformers] Unsloth: Restored added_tokens_decoder metadata in /data/checkpoints/sft/checkpoint-180/tokenizer_config.json.
SFT done β†’ /data/checkpoints/sft
```
### GRPO β€” environment + reward + rollout all built; blocked at trainer wiring
Every component the GRPO loop needs is implemented and committed:
- TRL `GRPOTrainer` integration in [grpo.py](src/ci_triage_env/training/grpo.py)
- `MockEnvClient` for in-process rollouts (no network) in [mock_env_client.py](src/ci_triage_env/training/mock_env_client.py)
- Multi-turn `TrainingRollout` calling the same composite reward
- Frozen reward-component weights so curves are comparable
We hit a chain of upstream version conflicts after the Qwen3-5 stack required `transformers v5` from git, which then required `torchao β‰₯ 0.13` (needed `torch β‰₯ 2.7`), which made Unsloth's fast-LoRA matmul kernel run with mismatched fp16/bf16 tensors during the GRPO forward pass:
```
RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16
at /opt/conda/lib/python3.11/site-packages/unsloth/kernels/utils.py:1059
in matmul_lora β†’ out.addmm_(XA, B.to(dtype), alpha=s)
```
We did get **all 9 reward components computing real values on real trajectories** during MockEnvClient testing β€” the reward signal is wired and shaped, just not yet connected to a policy gradient update. The plan/blocker is documented in detail above and reproducible from the notebook.
## 5 Β· How to run
### A. Use the environment (judges)
```bash
# Pull and run the env server (it's a Docker Space)
docker run -p 8000:8000 --pull always \
registry.hf.space/prasham1710-ci-triage-env:latest
# Or clone & run locally
git clone https://huggingface.co/spaces/Prasham1710/ci-triage-env
cd ci-triage-env && docker build -t ci-triage-env . && \
docker run -p 8000:8000 ci-triage-env
```
Then talk to it:
```bash
curl -X POST http://localhost:8000/reset # β†’ initial obs
curl -X POST http://localhost:8000/step \
-H 'Content-Type: application/json' \
-d '{"action":{"tool":"read_logs","args":{"test_name":"test_x"}}}'
curl http://localhost:8000/state
curl http://localhost:8000/docs # OpenAPI/Swagger
```
The Space also exposes the standard MCP route at `POST /mcp` (JSON-RPC) and `WS /mcp`.
### B. Reproduce training
```bash
git clone https://github.com/<this-repo>.git
cd CI-Triage-Env
pip install -e ".[data,training]"
# Set HF_TOKEN, HF_USERNAME, WANDB_API_KEY
jupyter lab notebooks/train_grpo.ipynb
```
Or use the Training Space (preconfigured for A10G):
https://huggingface.co/spaces/Prasham1710/ci-triage-training
## 6 Β· Engineering hygiene
- βœ… OpenEnv `MCPEnvironment` base class, valid `openenv.yaml`
- βœ… Standard Gym API: `reset`, `step`, `state`
- βœ… MCP tool names β€” none collide with reserved names (`reset`/`step`/`state`/`close`)
- βœ… Strict client/server separation (server in `env/`, clients import only the wire schemas)
- βœ… JSON-Schema validated action/observation envelopes ([schemas/](src/ci_triage_env/schemas/))
- βœ… FastAPI `/docs` for interactive exploration
## 7 Β· Why it matters
Every shop running CI burns engineering hours on triage. If a 4B-parameter LLM can do this reliably, that's hours back per on-call shift, plus a paper trail of *why* the diagnosis was made (the trace itself). The methodology β€” composable rubric rewards over multi-turn tool use against a budget β€” generalizes well beyond CI: incident response, code review triage, security alert triage all share the structure.
## 8 Β· Status (April 26 2026, submission day)
- βœ… Environment fully implemented + deployed to HF Space
- βœ… 3,500 scenarios + 700+ SFT trajectories generated, validated, published
- βœ… All 9 reward components implemented and replay-verified
- βœ… SFT warmstart trained end-to-end (Qwen3-4B + LoRA, 2 epochs, A10G Small)
- ⚠ GRPO loop blocked by Unsloth/torchao/transformers-v5 fp16/bf16 mismatch in `matmul_lora`; pipeline + reward signal verified separately
- 🚧 Inference UI (Streamlit) β€” out of time
## 9 Β· License & credits
Apache-2.0. Built with OpenEnv, Unsloth, TRL, Hugging Face Transformers, PyTorch.
Data scenarios are synthetic; no proprietary CI logs are included.
## Blog
We're publishing a companion mini-blog on Hugging Face explaining the environment design, the rubric reward, and what we learned. **Link will be inserted here once published** β€” judges, please check the top-of-page **Links** table.