𧬠ImmunoOrg 2.0: The Autonomous, Self-Healing Enterprise (GRPO Defender)
This model is a GRPO-trained LoRA adapter for Qwen2.5-0.5B-Instruct. It was developed for the OpenEnv Hackathon (India 2026) to serve as an autonomous security defender that operates across both technical and organizational layers.
| Resource | Link |
|---|---|
| π’ Live Demo (Space) | hirann/immunoorg-v3 |
| π Training Notebook | ImmunoOrg_Training_Colab.ipynb |
| π¬ Research Notes | RESEARCH.md |
π Model Overview
Unlike traditional security agents that only "patch servers," the ImmunoOrg Defender learns to close the Socio-Technical Gap. It understands that organizational silos and slow approval chains are as dangerous as open ports.
Core Capabilities:
- Adversarial Containment: Identifies and mitigates technical threats (SQLi, Lateral Movement, Data Exfiltration).
- Organizational Restructuring: Dynamically merges departments, establishes DevSecOps meshes, and reduces bureaucracy to speed up response times.
- Multi-Step Reasoning: Outputs structured JSON actions with internal "Reasoning" chains that align with MITRE ATT&CK techniques.
π Training Results (Evidence)
The model was trained using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that optimizes the model's policy based on group-relative scores rather than a global value function.
π Proof of Intelligence: Random vs. Trained
The trained agent shows a 4.1x improvement in total reward compared to a random baseline in standard containment scenarios.
Figure 1: Performance lift across difficulty levels 1-4. The GRPO agent consistently outperforms random and heuristic baselines.
π‘οΈ Multi-Track Reward Alignment
We used a 5-track composable reward system to prevent reward hacking:
- Technical Uptime: Reward for maintaining service availability.
- Threat Neutralization: Significant reward for containing attacks.
- Bureaucracy Reduction: Reward for optimizing the org graph.
- Reasoning Fidelity: Penalty for hallucinated techniques.
- Instruction Following: Reward for adhering to mid-episode Board Directives.
Figure 2: Per-step contribution of the 5 reward tracks. No single track dominates, forcing the agent to balance security with business continuity.
ποΈ Environment Architecture: The Socio-Technical Gap
The agent operates in a dual-graph environment:
- Technical Layer: A network of nodes (Web, DB, CI/CD) with varying security postures.
- Organizational Layer: A graph of departments (Security, Engineering, Finance) with approval weights and communication latencies.
Self-Healing in Action
The agent learns that it can't isolate a node if the "Engineering Lead" blocks it. To win, it must first execute a Strategic Action to establish_devsecops or reduce_bureaucracy.
Figure 3: Org graph before and after agent intervention. Approval latency dropped from 72 hours to 4 hours by restructuring the communication mesh.
π¬ Training Procedure (GRPO)
- Algorithm: Group Relative Policy Optimization (GRPO) via
trl. - Base Model:
Qwen/Qwen2.5-0.5B-Instruct(Unsloth 4-bit). - Dataset: 1,700+ environment-native trajectories generated from
ImmunoOrgEnvironment. - Reward Functions:
format_reward: Ensures valid JSON and MITRE technique IDs.reasoning_quality_reward: Rewards causal logic and grounding in observations.env_performance_reward: Direct signal from the simulation (reward per step).
Figure 4: Reward increase across 100 training steps (Group Relative scores).
π οΈ Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "hirann/immunoorg2-grpo-0.5b"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
# Prepare observation
obs = "Current phase: Detection | Threat: SQL Injection detected on node-1 | Approval status: BLOCKED by engineering"
prompt = f"You are the ImmunoOrg Defender. Observation: {obs}\nAction:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))
βοΈ License
Apache-2.0. Built for the OpenEnv Hackathon 2026.