🧬 ImmunoOrg 2.0: The Autonomous, Self-Healing Enterprise (GRPO Defender)

This model is a GRPO-trained LoRA adapter for Qwen2.5-0.5B-Instruct. It was developed for the OpenEnv Hackathon (India 2026) to serve as an autonomous security defender that operates across both technical and organizational layers.

Resource	Link
🟢 Live Demo (Space)	hirann/immunoorg-v3
📔 Training Notebook	ImmunoOrg_Training_Colab.ipynb
🔬 Research Notes	RESEARCH.md

🚀 Model Overview

Unlike traditional security agents that only "patch servers," the ImmunoOrg Defender learns to close the Socio-Technical Gap. It understands that organizational silos and slow approval chains are as dangerous as open ports.

Core Capabilities:

Adversarial Containment: Identifies and mitigates technical threats (SQLi, Lateral Movement, Data Exfiltration).
Organizational Restructuring: Dynamically merges departments, establishes DevSecOps meshes, and reduces bureaucracy to speed up response times.
Multi-Step Reasoning: Outputs structured JSON actions with internal "Reasoning" chains that align with MITRE ATT&CK techniques.

📊 Training Results (Evidence)

The model was trained using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that optimizes the model's policy based on group-relative scores rather than a global value function.

📈 Proof of Intelligence: Random vs. Trained

The trained agent shows a 4.1x improvement in total reward compared to a random baseline in standard containment scenarios.

Figure 1: Performance lift across difficulty levels 1-4. The GRPO agent consistently outperforms random and heuristic baselines.

🛡️ Multi-Track Reward Alignment

We used a 5-track composable reward system to prevent reward hacking:

Technical Uptime: Reward for maintaining service availability.
Threat Neutralization: Significant reward for containing attacks.
Bureaucracy Reduction: Reward for optimizing the org graph.
Reasoning Fidelity: Penalty for hallucinated techniques.
Instruction Following: Reward for adhering to mid-episode Board Directives.

Figure 2: Per-step contribution of the 5 reward tracks. No single track dominates, forcing the agent to balance security with business continuity.

🏗️ Environment Architecture: The Socio-Technical Gap

The agent operates in a dual-graph environment:

Technical Layer: A network of nodes (Web, DB, CI/CD) with varying security postures.
Organizational Layer: A graph of departments (Security, Engineering, Finance) with approval weights and communication latencies.

Self-Healing in Action

The agent learns that it can't isolate a node if the "Engineering Lead" blocks it. To win, it must first execute a Strategic Action to establish_devsecops or reduce_bureaucracy.

Figure 3: Org graph before and after agent intervention. Approval latency dropped from 72 hours to 4 hours by restructuring the communication mesh.

🔬 Training Procedure (GRPO)

Algorithm: Group Relative Policy Optimization (GRPO) via trl.
Base Model: Qwen/Qwen2.5-0.5B-Instruct (Unsloth 4-bit).
Dataset: 1,700+ environment-native trajectories generated from ImmunoOrgEnvironment.
Reward Functions:
1. format_reward: Ensures valid JSON and MITRE technique IDs.
2. reasoning_quality_reward: Rewards causal logic and grounding in observations.
3. env_performance_reward: Direct signal from the simulation (reward per step).

Figure 4: Reward increase across 100 training steps (Group Relative scores).

🛠️ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "hirann/immunoorg2-grpo-0.5b"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

# Prepare observation
obs = "Current phase: Detection | Threat: SQL Injection detected on node-1 | Approval status: BLOCKED by engineering"
prompt = f"You are the ImmunoOrg Defender. Observation: {obs}\nAction:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

⚖️ License

Apache-2.0. Built for the OpenEnv Hackathon 2026.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hirann/immunoorg2-grpo-0.5b

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct