ChaosOps AI β€” GRPO LoRA Adapter

LoRA adapter for Qwen 2.5-1.5B-Instruct, fine-tuned with GRPO (Group Relative Policy Optimization, via TRL) on the ChaosOps AI multi-agent incident-response environment.

What ChaosOps trains

Four LLM agents β€” SRE Β· Developer Β· Manager Β· Oversight β€” handle production-incident scenarios (DB deadlock, memory leak, bad config push, autoscaler cost-cut by a rogue AI, misrouted traffic by a rogue load-balancer, cascade, DNS outage, disk full, rogue deploy bot) under partial observability. The Oversight agent is rewarded for catching when another AI in the fleet caused the incident before the team applies a fix.

Training recipe

Algorithm TRL GRPO
Base Qwen/Qwen2.5-3B-Instruct
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
LoRA rank / Ξ± 32 / 32
Group size 2 completions per prompt
Optimization steps 600 (curriculum: 200 EASY β†’ 200 MEDIUM β†’ 200 HARD)
Learning rate 2e-5 (4Γ— the TRL default; previous flat-LR run plateaued)
Sampling temperature 0.8
Rogue rubric multiplier 2.0 (catch +100, FP βˆ’150)
Max prompt length 1024 tokens
Max completion length 96 tokens
Hardware NVIDIA L4 (24 GB) via HF Jobs
Reward streams team (0.6) + oversight (0.4)
Final KL to base 0.595 (sustained, controlled movement)

The reward is composed from four named OpenEnv-style rubrics: resolution, mttr, oversight, cascade β€” see the ChaosOps source.

Files

  • adapter_model.safetensors + adapter_config.json β€” the LoRA itself
  • training_metrics.json β€” per-log reward + loss + KL stream
  • learning_curve.png β€” reward curve (axis-labelled, 150 dpi)

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-1.5B-Instruct", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
model = PeftModel.from_pretrained(base, "helloAK96/chaosops-grpo-lora-p3a")

Or activate it directly inside the live ChaosOps Space by setting the Space secret CHAOSOPS_ADAPTER_PATH=helloAK96/chaosops-grpo-lora-p3a β€” the Space will lazily snapshot-download the adapter on first request and route the trained policy through it.

Results

Evaluation across 5 seeds Γ— 9 failure types Γ— 3 tiers (540 episodes per policy):

Tier Random Heuristic Trained (this LoRA) Oracle
EASY βˆ’315.4 / 55% +94.4 / 100% +49.2 / 85% +134.0 /100%
MEDIUM βˆ’641.3 / 40% +108.7 /100% βˆ’16.9 / 80%, 100% rogue+ +150.7 /100%
HARD βˆ’1356.6 / 10% βˆ’425.3 / 50% βˆ’433.4 / 40%, 93% rogue+ +170.0 /100%

Mean cumulative episode reward / resolution rate. Trained:

  • closes the heuristic gap to ≀8 points on HARD, ≀45 on EASY;
  • is the only sub-Oracle policy that flags rogue fleet agents on MEDIUM (100% catch rate vs heuristic's 0%);
  • MTTR = 3.0 across all tiers β€” matches the Oracle on the few episodes the heuristic doesn't already 1-shot;
  • +94.6 / +624.4 / +923.2 point gain over random across EASY / MEDIUM / HARD.

The training-time reward curve in learning_curve.png shows positive mean reward in all three tiers, with HARD-tier mean (+14.0) ending better than EASY-tier mean (+6.9) β€” the curriculum let the model absorb easier-tier dynamics, then stack competencies on harder ones.

Trade-off: 13–35% false-positive rogue-flag rate on tiers without a real rogue. The 2Γ— rogue-rubric weight made the model eager to flag.

Links

Citation

@misc{chaosops_ai_2026,
  title  = {ChaosOps AI: a multi-agent incident-response gym with rogue-agent detection},
  author = {ChaosOps AI Team},
  year   = {2026},
  url    = {https://huggingface.co/spaces/helloAK96/chaosops}
}
Downloads last month
84
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for helloAK96/chaosops-grpo-lora-p3a

Base model

Qwen/Qwen2.5-3B
Adapter
(1225)
this model