Spaces:
Sleeping
title: CI Triage Training
emoji: ποΈ
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
hardware: a10g-small
pinned: false
title: CI Triage Env emoji: π colorFrom: blue colorTo: green sdk: docker app_port: 8000 pinned: false license: apache-2.0 tags: - openenv - reinforcement-learning - llm - tool-use - ci-cd - grpo - qwen3
CI-Triage-Env
An OpenEnv RL environment for training LLMs to investigate ambiguous CI failures with verifiable, composable rewards.
Built for the Scaler Γ Meta Γ PyTorch OpenEnv Hackathon 2026. Team: Prasham Jain (lead), Sahil, Priyanshi Maheshwari.
Links
| Resource | URL |
|---|---|
| π€ Environment Space (judge entrypoint) | https://huggingface.co/spaces/Prasham1710/ci-triage-env |
| π€ Training Space | https://huggingface.co/spaces/Prasham1710/ci-triage-training |
| π¦ Scenarios dataset | https://huggingface.co/datasets/Prasham1710/ci-triage-scenarios |
| π¦ SFT trajectories dataset | https://huggingface.co/datasets/Prasham1710/ci-triage-sft |
| π§ SFT checkpoint (Qwen3-4B + LoRA) | https://huggingface.co/Prasham1710/ci-triage-agent-sft |
| π Blog post (in this repo) | ci-triage-blog-final.md β also being published as an HF Community Article |
| π Training notebook | notebooks/train_grpo.ipynb |
| π» Source code | this repository |
1 Β· Problem β the capability gap
Modern CI pipelines fail for many ambiguous reasons: flaky tests, infra blips, real bugs, dependency drift, missing secrets, race conditions, test data rot. Today, an on-call engineer wastes ~30 min per failure pulling logs, checking flake history, blaming commits, and deciding whether to rerun, quarantine, or file a bug.
Frontier LLMs are not trained on this loop. They can read a single log dump, but they don't investigate: they don't choose which tool to call next, they don't trade off cost vs. information, and they reach diagnoses without supporting evidence. We could not find a public RL environment that rewards an LLM for multi-turn, evidence-grounded triage under a budget.
That is the gap CI-Triage-Env is built to close.
2 Β· Environment β what the agent sees, does, and is graded on
CI-Triage-Env is a fully OpenEnv-compliant MCPEnvironment with a Gym-style API (reset / step / state).
Observation
Each episode begins with a scenario: a synthesised CI failure (~3,500 generated, 200 hand-validated for the train split) containing a failure_summary, the failing test, a hidden ground_truth_root_cause, and an oracle "minimal evidence set" β the smallest set of tools whose outputs together justify the correct diagnosis.
Action space β 11 MCP tools
| Tool | What it does | Cost |
|---|---|---|
read_logs |
full CI log for a test | 1 |
inspect_test_code |
test source | 1 |
run_diagnostic |
scoped shell probe | 3 |
cluster_metrics |
infra metrics for a window | 2 |
query_flake_history |
flake stats for a test | 1 |
recent_commits |
recent commits to repo/path | 1 |
check_owner |
CODEOWNERS lookup | 0 |
rerun_test |
quarantine probe (1Γ) | 5 |
quarantine_test |
quarantines (terminal-ish) | β |
file_bug |
files a bug (terminal) | β |
ping_owner |
message owner (terminal) | β |
The agent operates under a total cost budget per episode. Each step returns the tool's structured output plus the remaining budget.
Reward β 9 composable, frozen-weight components
Built using OpenEnv's rubric pattern (composition, not a monolith):
| # | Component | Weight | Captures |
|---|---|---|---|
| 1 | diagnosis |
0.25 | Was the predicted root cause correct? |
| 2 | minimal_evidence |
0.20 | Did the agent collect the oracle evidence set? |
| 3 | cost_efficiency |
0.15 | Did it stay within budget? |
| 4 | action_quality |
0.10 | Were tool calls well-formed and contextually sensible? |
| 5 | investigation |
0.10 | Did the trace branch logically? |
| 6 | format_gate |
0.05 | Strict JSON schema compliance |
| 7 | time_penalty |
0.05 | Penalty for excessive turns |
| 8 | counterfactual_predict |
0.05 | Was the predicted "fix" plausible vs ground truth? |
| 9 | anti_gaming |
0.05 | Penalty for repeat / spam / quarantine-spam patterns |
Why this is hard to game: every component is verifiable from the trace; the highest-weight terms (diagnosis, minimal_evidence) directly require the agent to find the right cause with the right evidence. The anti_gaming and cost_efficiency terms specifically punish the dominant exploit (always-quarantine; spam tools).
Weights live in src/ci_triage_env/rewards/weights.py; replay verifier is in src/ci_triage_env/rewards/replay.py.
3 Β· Pipeline β training top to bottom
βββββββββββββββββββββββ
β 3,500 scenarios β β clustered from real OSS CI logs
β (synth + LLM-aug) β + LLM-generated edge cases
ββββββββββββ¬βββββββββββ
β
βββββββββββΌβββββββββββ
β SFT trajectories β teacher: GPT-4o-mini
β ~700 episodes β filtered: reward β₯ 0.6
βββββββββββ¬βββββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β Qwen3-4B + LoRA (r=16) via Unsloth β
β SFT β 2 epochs, bf16, A10G Small β β
DONE β see plot
βββββββββββββββββββββ¬ββββββββββββββββββββ
β
βββββββββββββββββββββΌββββββββββββββββββββ
β GRPO Β· TRL Β· multi-turn rollout β
β reward = composite of 9 components β β BLOCKED β see "Status"
βββββββββββββββββββββββββββββββββββββββββ
Sources:
- SFT trainer β src/ci_triage_env/training/sft.py
- GRPO trainer β src/ci_triage_env/training/grpo.py
- Multi-turn rollout β src/ci_triage_env/training/rollout.py
- Composite reward β src/ci_triage_env/rewards/composite.py
- Notebook judges can re-run β notebooks/train_grpo.ipynb
4 Β· Results & Evidence
SFT β completed, real run on A10G Small
We trained Qwen3-4B + LoRA via Unsloth on 718 SFT trajectories for 2 epochs. The run completed end-to-end and the checkpoint is on the HF Hub.
- Checkpoint: https://huggingface.co/Prasham1710/ci-triage-agent-sft
- W&B run: https://wandb.ai/jainprasham17-esds/ci-triage-env
- Final training loss: ~0.55 (from log line
180 0.548925) - Hardware: A10G Small, 24 GB VRAM, ~50 min wall-clock
Loss curve from the real SFT run β smooth descent from ~1.4 β 0.55 over 180 steps.
Concrete log excerpt from the run:
Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained)
...
180 0.548925
[transformers] Unsloth: Restored added_tokens_decoder metadata in /data/checkpoints/sft/checkpoint-180/tokenizer_config.json.
SFT done β /data/checkpoints/sft
GRPO β environment + reward + rollout all built; blocked at trainer wiring
Every component the GRPO loop needs is implemented and committed:
- TRL
GRPOTrainerintegration in grpo.py MockEnvClientfor in-process rollouts (no network) in mock_env_client.py- Multi-turn
TrainingRolloutcalling the same composite reward - Frozen reward-component weights so curves are comparable
We hit a chain of upstream version conflicts after the Qwen3-5 stack required transformers v5 from git, which then required torchao β₯ 0.13 (needed torch β₯ 2.7), which made Unsloth's fast-LoRA matmul kernel run with mismatched fp16/bf16 tensors during the GRPO forward pass:
RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16
at /opt/conda/lib/python3.11/site-packages/unsloth/kernels/utils.py:1059
in matmul_lora β out.addmm_(XA, B.to(dtype), alpha=s)
We did get all 9 reward components computing real values on real trajectories during MockEnvClient testing β the reward signal is wired and shaped, just not yet connected to a policy gradient update. The plan/blocker is documented in detail above and reproducible from the notebook.
5 Β· How to run
A. Use the environment (judges)
# Pull and run the env server (it's a Docker Space)
docker run -p 8000:8000 --pull always \
registry.hf.space/prasham1710-ci-triage-env:latest
# Or clone & run locally
git clone https://huggingface.co/spaces/Prasham1710/ci-triage-env
cd ci-triage-env && docker build -t ci-triage-env . && \
docker run -p 8000:8000 ci-triage-env
Then talk to it:
curl -X POST http://localhost:8000/reset # β initial obs
curl -X POST http://localhost:8000/step \
-H 'Content-Type: application/json' \
-d '{"action":{"tool":"read_logs","args":{"test_name":"test_x"}}}'
curl http://localhost:8000/state
curl http://localhost:8000/docs # OpenAPI/Swagger
The Space also exposes the standard MCP route at POST /mcp (JSON-RPC) and WS /mcp.
B. Reproduce training
git clone https://github.com/<this-repo>.git
cd CI-Triage-Env
pip install -e ".[data,training]"
# Set HF_TOKEN, HF_USERNAME, WANDB_API_KEY
jupyter lab notebooks/train_grpo.ipynb
Or use the Training Space (preconfigured for A10G): https://huggingface.co/spaces/Prasham1710/ci-triage-training
6 Β· Engineering hygiene
- β
OpenEnv
MCPEnvironmentbase class, validopenenv.yaml - β
Standard Gym API:
reset,step,state - β
MCP tool names β none collide with reserved names (
reset/step/state/close) - β
Strict client/server separation (server in
env/, clients import only the wire schemas) - β JSON-Schema validated action/observation envelopes (schemas/)
- β
FastAPI
/docsfor interactive exploration
7 Β· Why it matters
Every shop running CI burns engineering hours on triage. If a 4B-parameter LLM can do this reliably, that's hours back per on-call shift, plus a paper trail of why the diagnosis was made (the trace itself). The methodology β composable rubric rewards over multi-turn tool use against a budget β generalizes well beyond CI: incident response, code review triage, security alert triage all share the structure.
8 Β· Status (April 26 2026, submission day)
- β Environment fully implemented + deployed to HF Space
- β 3,500 scenarios + 700+ SFT trajectories generated, validated, published
- β All 9 reward components implemented and replay-verified
- β SFT warmstart trained end-to-end (Qwen3-4B + LoRA, 2 epochs, A10G Small)
- β GRPO loop blocked by Unsloth/torchao/transformers-v5 fp16/bf16 mismatch in
matmul_lora; pipeline + reward signal verified separately - π§ Inference UI (Streamlit) β out of time
9 Β· License & credits
Apache-2.0. Built with OpenEnv, Unsloth, TRL, Hugging Face Transformers, PyTorch. Data scenarios are synthetic; no proprietary CI logs are included.
Blog
We're publishing a companion mini-blog on Hugging Face explaining the environment design, the rubric reward, and what we learned. Link will be inserted here once published β judges, please check the top-of-page Links table.
