ci-triage-env / README.md
Prasham.Jain
chore(spaces): add HF Space config for env server [skip ci]
0e3e045
metadata
title: CI Triage Env
emoji: πŸ”
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
pinned: false

title: CI Triage Env emoji: πŸ” colorFrom: blue colorTo: green sdk: docker app_port: 8000 pinned: false license: apache-2.0 tags: - openenv - reinforcement-learning - llm - tool-use - ci-cd - grpo - qwen3

CI-Triage-Env

An OpenEnv RL environment for training LLMs to investigate ambiguous CI failures with verifiable, composable rewards.

Built for the Scaler Γ— Meta Γ— PyTorch OpenEnv Hackathon 2026. Team: Prasham Jain (lead), Sahil, Priyanshi Maheshwari.


Links

Resource URL
πŸ€— Environment Space (judge entrypoint) https://huggingface.co/spaces/Prasham1710/ci-triage-env
πŸ€— Training Space https://huggingface.co/spaces/Prasham1710/ci-triage-training
πŸ“¦ Scenarios dataset https://huggingface.co/datasets/Prasham1710/ci-triage-scenarios
πŸ“¦ SFT trajectories dataset https://huggingface.co/datasets/Prasham1710/ci-triage-sft
🧠 SFT checkpoint (Qwen3-4B + LoRA) https://huggingface.co/Prasham1710/ci-triage-agent-sft
πŸ“ Blog post (in this repo) ci-triage-blog-final.md β€” also being published as an HF Community Article
πŸ““ Training notebook notebooks/train_grpo.ipynb
πŸ’» Source code this repository

1 Β· Problem β€” the capability gap

Modern CI pipelines fail for many ambiguous reasons: flaky tests, infra blips, real bugs, dependency drift, missing secrets, race conditions, test data rot. Today, an on-call engineer wastes ~30 min per failure pulling logs, checking flake history, blaming commits, and deciding whether to rerun, quarantine, or file a bug.

Frontier LLMs are not trained on this loop. They can read a single log dump, but they don't investigate: they don't choose which tool to call next, they don't trade off cost vs. information, and they reach diagnoses without supporting evidence. We could not find a public RL environment that rewards an LLM for multi-turn, evidence-grounded triage under a budget.

That is the gap CI-Triage-Env is built to close.

2 Β· Environment β€” what the agent sees, does, and is graded on

CI-Triage-Env is a fully OpenEnv-compliant MCPEnvironment with a Gym-style API (reset / step / state).

Observation

Each episode begins with a scenario: a synthesised CI failure (~3,500 generated, 200 hand-validated for the train split) containing a failure_summary, the failing test, a hidden ground_truth_root_cause, and an oracle "minimal evidence set" β€” the smallest set of tools whose outputs together justify the correct diagnosis.

Action space β€” 11 MCP tools

Tool What it does Cost
read_logs full CI log for a test 1
inspect_test_code test source 1
run_diagnostic scoped shell probe 3
cluster_metrics infra metrics for a window 2
query_flake_history flake stats for a test 1
recent_commits recent commits to repo/path 1
check_owner CODEOWNERS lookup 0
rerun_test quarantine probe (1Γ—) 5
quarantine_test quarantines (terminal-ish) β€”
file_bug files a bug (terminal) β€”
ping_owner message owner (terminal) β€”

The agent operates under a total cost budget per episode. Each step returns the tool's structured output plus the remaining budget.

Reward β€” 9 composable, frozen-weight components

Built using OpenEnv's rubric pattern (composition, not a monolith):

# Component Weight Captures
1 diagnosis 0.25 Was the predicted root cause correct?
2 minimal_evidence 0.20 Did the agent collect the oracle evidence set?
3 cost_efficiency 0.15 Did it stay within budget?
4 action_quality 0.10 Were tool calls well-formed and contextually sensible?
5 investigation 0.10 Did the trace branch logically?
6 format_gate 0.05 Strict JSON schema compliance
7 time_penalty 0.05 Penalty for excessive turns
8 counterfactual_predict 0.05 Was the predicted "fix" plausible vs ground truth?
9 anti_gaming 0.05 Penalty for repeat / spam / quarantine-spam patterns

Why this is hard to game: every component is verifiable from the trace; the highest-weight terms (diagnosis, minimal_evidence) directly require the agent to find the right cause with the right evidence. The anti_gaming and cost_efficiency terms specifically punish the dominant exploit (always-quarantine; spam tools).

Weights live in src/ci_triage_env/rewards/weights.py; replay verifier is in src/ci_triage_env/rewards/replay.py.

3 Β· Pipeline β€” training top to bottom

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  3,500 scenarios    β”‚   ← clustered from real OSS CI logs
                 β”‚  (synth + LLM-aug)  β”‚       + LLM-generated edge cases
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                  β”‚ SFT trajectories   β”‚   teacher: GPT-4o-mini
                  β”‚ ~700 episodes      β”‚   filtered: reward β‰₯ 0.6
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  Qwen3-4B  +  LoRA (r=16)  via Unsloth β”‚
        β”‚  SFT  β†’  2 epochs, bf16, A10G Small   β”‚   βœ… DONE β€” see plot
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  GRPO  Β·  TRL Β·  multi-turn rollout   β”‚
        β”‚  reward = composite of 9 components   β”‚   ⚠ BLOCKED β€” see "Status"
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sources:

4 Β· Results & Evidence

SFT β€” completed, real run on A10G Small

We trained Qwen3-4B + LoRA via Unsloth on 718 SFT trajectories for 2 epochs. The run completed end-to-end and the checkpoint is on the HF Hub.

SFT training loss β€” Qwen3-4B + LoRA, 2 epochs on A10G Small

Loss curve from the real SFT run β€” smooth descent from ~1.4 β†’ 0.55 over 180 steps.

Concrete log excerpt from the run:

Trainable parameters = 33,030,144 of 4,055,498,240 (0.81% trained)
...
180   0.548925
[transformers] Unsloth: Restored added_tokens_decoder metadata in /data/checkpoints/sft/checkpoint-180/tokenizer_config.json.
SFT done β†’ /data/checkpoints/sft

GRPO β€” environment + reward + rollout all built; blocked at trainer wiring

Every component the GRPO loop needs is implemented and committed:

  • TRL GRPOTrainer integration in grpo.py
  • MockEnvClient for in-process rollouts (no network) in mock_env_client.py
  • Multi-turn TrainingRollout calling the same composite reward
  • Frozen reward-component weights so curves are comparable

We hit a chain of upstream version conflicts after the Qwen3-5 stack required transformers v5 from git, which then required torchao β‰₯ 0.13 (needed torch β‰₯ 2.7), which made Unsloth's fast-LoRA matmul kernel run with mismatched fp16/bf16 tensors during the GRPO forward pass:

RuntimeError: self and mat2 must have the same dtype, but got Half and BFloat16
  at /opt/conda/lib/python3.11/site-packages/unsloth/kernels/utils.py:1059
  in matmul_lora β†’  out.addmm_(XA, B.to(dtype), alpha=s)

We did get all 9 reward components computing real values on real trajectories during MockEnvClient testing β€” the reward signal is wired and shaped, just not yet connected to a policy gradient update. The plan/blocker is documented in detail above and reproducible from the notebook.

5 Β· How to run

A. Use the environment (judges)

# Pull and run the env server (it's a Docker Space)
docker run -p 8000:8000 --pull always \
  registry.hf.space/prasham1710-ci-triage-env:latest

# Or clone & run locally
git clone https://huggingface.co/spaces/Prasham1710/ci-triage-env
cd ci-triage-env && docker build -t ci-triage-env . && \
  docker run -p 8000:8000 ci-triage-env

Then talk to it:

curl -X POST http://localhost:8000/reset                          # β†’ initial obs
curl -X POST http://localhost:8000/step \
     -H 'Content-Type: application/json' \
     -d '{"action":{"tool":"read_logs","args":{"test_name":"test_x"}}}'
curl http://localhost:8000/state
curl http://localhost:8000/docs                                    # OpenAPI/Swagger

The Space also exposes the standard MCP route at POST /mcp (JSON-RPC) and WS /mcp.

B. Reproduce training

git clone https://github.com/<this-repo>.git
cd CI-Triage-Env
pip install -e ".[data,training]"

# Set HF_TOKEN, HF_USERNAME, WANDB_API_KEY
jupyter lab notebooks/train_grpo.ipynb

Or use the Training Space (preconfigured for A10G): https://huggingface.co/spaces/Prasham1710/ci-triage-training

6 Β· Engineering hygiene

  • βœ… OpenEnv MCPEnvironment base class, valid openenv.yaml
  • βœ… Standard Gym API: reset, step, state
  • βœ… MCP tool names β€” none collide with reserved names (reset/step/state/close)
  • βœ… Strict client/server separation (server in env/, clients import only the wire schemas)
  • βœ… JSON-Schema validated action/observation envelopes (schemas/)
  • βœ… FastAPI /docs for interactive exploration

7 Β· Why it matters

Every shop running CI burns engineering hours on triage. If a 4B-parameter LLM can do this reliably, that's hours back per on-call shift, plus a paper trail of why the diagnosis was made (the trace itself). The methodology β€” composable rubric rewards over multi-turn tool use against a budget β€” generalizes well beyond CI: incident response, code review triage, security alert triage all share the structure.

8 Β· Status (April 26 2026, submission day)

  • βœ… Environment fully implemented + deployed to HF Space
  • βœ… 3,500 scenarios + 700+ SFT trajectories generated, validated, published
  • βœ… All 9 reward components implemented and replay-verified
  • βœ… SFT warmstart trained end-to-end (Qwen3-4B + LoRA, 2 epochs, A10G Small)
  • ⚠ GRPO loop blocked by Unsloth/torchao/transformers-v5 fp16/bf16 mismatch in matmul_lora; pipeline + reward signal verified separately
  • 🚧 Inference UI (Streamlit) β€” out of time

9 Β· License & credits

Apache-2.0. Built with OpenEnv, Unsloth, TRL, Hugging Face Transformers, PyTorch. Data scenarios are synthetic; no proprietary CI logs are included.

Blog

We're publishing a companion mini-blog on Hugging Face explaining the environment design, the rubric reward, and what we learned. Link will be inserted here once published β€” judges, please check the top-of-page Links table.