🧬 ImmunoOrg 2.0: The Autonomous, Self-Healing Enterprise (GRPO Defender)

This model is a GRPO-trained LoRA adapter for Qwen2.5-0.5B-Instruct. It was developed for the OpenEnv Hackathon (India 2026) to serve as an autonomous security defender that operates across both technical and organizational layers.

Resource Link
🟒 Live Demo (Space) hirann/immunoorg-v3
πŸ“” Training Notebook ImmunoOrg_Training_Colab.ipynb
πŸ”¬ Research Notes RESEARCH.md

πŸš€ Model Overview

Unlike traditional security agents that only "patch servers," the ImmunoOrg Defender learns to close the Socio-Technical Gap. It understands that organizational silos and slow approval chains are as dangerous as open ports.

Core Capabilities:

  1. Adversarial Containment: Identifies and mitigates technical threats (SQLi, Lateral Movement, Data Exfiltration).
  2. Organizational Restructuring: Dynamically merges departments, establishes DevSecOps meshes, and reduces bureaucracy to speed up response times.
  3. Multi-Step Reasoning: Outputs structured JSON actions with internal "Reasoning" chains that align with MITRE ATT&CK techniques.

πŸ“Š Training Results (Evidence)

The model was trained using Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that optimizes the model's policy based on group-relative scores rather than a global value function.

πŸ“ˆ Proof of Intelligence: Random vs. Trained

The trained agent shows a 4.1x improvement in total reward compared to a random baseline in standard containment scenarios.

Policy Comparison Figure 1: Performance lift across difficulty levels 1-4. The GRPO agent consistently outperforms random and heuristic baselines.

πŸ›‘οΈ Multi-Track Reward Alignment

We used a 5-track composable reward system to prevent reward hacking:

  • Technical Uptime: Reward for maintaining service availability.
  • Threat Neutralization: Significant reward for containing attacks.
  • Bureaucracy Reduction: Reward for optimizing the org graph.
  • Reasoning Fidelity: Penalty for hallucinated techniques.
  • Instruction Following: Reward for adhering to mid-episode Board Directives.

5-Track Reward Figure 2: Per-step contribution of the 5 reward tracks. No single track dominates, forcing the agent to balance security with business continuity.


πŸ—οΈ Environment Architecture: The Socio-Technical Gap

The agent operates in a dual-graph environment:

  • Technical Layer: A network of nodes (Web, DB, CI/CD) with varying security postures.
  • Organizational Layer: A graph of departments (Security, Engineering, Finance) with approval weights and communication latencies.

Self-Healing in Action

The agent learns that it can't isolate a node if the "Engineering Lead" blocks it. To win, it must first execute a Strategic Action to establish_devsecops or reduce_bureaucracy.

Org Restructuring Figure 3: Org graph before and after agent intervention. Approval latency dropped from 72 hours to 4 hours by restructuring the communication mesh.


πŸ”¬ Training Procedure (GRPO)

  • Algorithm: Group Relative Policy Optimization (GRPO) via trl.
  • Base Model: Qwen/Qwen2.5-0.5B-Instruct (Unsloth 4-bit).
  • Dataset: 1,700+ environment-native trajectories generated from ImmunoOrgEnvironment.
  • Reward Functions:
    1. format_reward: Ensures valid JSON and MITRE technique IDs.
    2. reasoning_quality_reward: Rewards causal logic and grounding in observations.
    3. env_performance_reward: Direct signal from the simulation (reward per step).

Training Curve Figure 4: Reward increase across 100 training steps (Group Relative scores).


πŸ› οΈ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "hirann/immunoorg2-grpo-0.5b"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

# Prepare observation
obs = "Current phase: Detection | Threat: SQL Injection detected on node-1 | Approval status: BLOCKED by engineering"
prompt = f"You are the ImmunoOrg Defender. Observation: {obs}\nAction:"

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

βš–οΈ License

Apache-2.0. Built for the OpenEnv Hackathon 2026.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for hirann/immunoorg2-grpo-0.5b

Finetuned
(747)
this model

Space using hirann/immunoorg2-grpo-0.5b 1