av-codes
/

prompt-injection-hrm-text

@@ -1,121 +1,63 @@
 ---
 license: mit
-language: en
 tags:
   - prompt-injection
-  - security
   - hrm-text
   - hierarchical-reasoning-model
-  - from-scratch
-  - byte-level
-datasets:
-  - Bordair/bordair-multimodal
-metrics:
-  - f1
-  - accuracy
-  - precision
-  - recall
-model-index:
-  - name: prompt-injection-hrm-text
-    results:
-      - task:
-          type: text-classification
-          name: Prompt Injection Detection
-        dataset:
-          name: Bordair Multimodal
-          type: Bordair/bordair-multimodal
-        metrics:
-          - type: f1
-            value: 0.974
-          - type: accuracy
-            value: 0.9762
-          - type: precision
-            value: 0.9858
-          - type: recall
-            value: 0.9633
 ---
 # HRM-Text Prompt Injection Detector
-A from-scratch, byte-level prompt injection detector using the HRM-Text (Hierarchical Reasoning Model for Text) architecture. Trained on the Bordair multimodal dataset (476K samples) with v1-v5 adversarial attacks.
-## Key Results
 | Metric | Value |
 |--------|-------|
-| F1 | **0.974** |
-| Accuracy | 0.976 |
-| Precision | 0.986 |
-| Recall | 0.963 |
-### Comparison on Bordair Multimodal (same eval split, 47,644 samples)
-| Model | F1 | Params | Training | Cost | Pretrained? |
-|-------|-----|--------|----------|------|------------|
-| DistilBERT v2 (61K data, zero-shot) | 0.278 | 67M | - | - | Yes (wrong data) |
-| **HRM-Text (this model)** | **0.974** | 46.2M | 33h | ~7 | No |
-| DistilBERT (bordair fine-tuned) | 0.999 | 67M | 0.9h | /bin/bash.72 | Yes |
 ## Architecture
-HRM-Text is a hierarchical recurrent transformer trained from raw bytes (no tokenizer, no pretrained weights).
-- **Input**: raw UTF-8 bytes (vocab=256), max 2,048 bytes
-- **L-module**: 3-layer transformer processing byte sequences (local patterns, word boundaries)
-- **H-module**: 3-layer transformer reasoning over L-module output (sentence meaning, intent)
-- **Recurrent cascade**: L runs 3x, then H runs 1x, repeated 2 cycles
-- **BP warmup**: gradient flow depth increases from 2 to 5 recurrent steps over first 20% of training
-- **Position encoding**: RoPE (Rotary Position Embeddings)
-- **FFN**: SwiGLU (same as LLaMA)
-- **Attention**: causal, with gradient checkpointing
-- **Classification**: last-token pooling from H-module -> linear head -> binary (safe/injection)
-- **Parameters**: 46.2M
-## Training
-- **Dataset**: [Bordair/bordair-multimodal](https://huggingface.co/datasets/Bordair/bordair-multimodal) -- 476,431 samples (224,855 injection + 251,576 safe)
-- **Attack types**: v1-v5 escalating sophistication (multi-turn, cross-modal, MCP injection, reasoning DoS)
-- **Split**: 90/10 stratified (seed=42) -- 428,787 train / 47,644 eval
-- **Hardware**: NVIDIA L4 (24GB VRAM) via HF Jobs
-- **Precision**: fp16
-- **Batch size**: 32
-- **Learning rate**: 5e-4 with cosine schedule
-- **Warmup**: 500 steps
-- **Context**: 2,048 bytes (covers ~90% of samples; p95=3,916 bytes)
-- **Step time**: ~3 sec/step on L4
-### Eval Progression
-| Step | Epoch | F1 | Notes |
-|------|-------|-----|-------|
-| 4,000 | 0.30 | 0.971 | First eval |
-| 8,000 | 0.60 | **0.974** | Best checkpoint (this model) |
-| 12,000 | 0.90 | 0.934 | Temporary regression |
-| 16,000 | 1.19 | 0.971 | Recovered; NaN grad_norms appearing |
-Training was stopped at epoch ~1.2 due to fp16 instability (NaN gradients) and F1 plateau. Best checkpoint at step 8,000 was retained.
-## Why This Model Exists
-This is a research model exploring whether hierarchical recurrent transformers can learn effective text classification from raw bytes without pretrained language understanding.
-**Key findings:**
-- A from-scratch byte-level model can reach 0.974 F1 on adversarial prompt injection -- remarkably close to pretrained models
-- Data matters more than architecture: DistilBERT trained on 61K simpler samples scores only 0.278 F1 on this dataset
-- Pretrained + right data still wins: DistilBERT fine-tuned on the same bordair data hits 0.999 F1 in 1/37th the time
-- The hierarchical recurrence successfully learns word-level and phrase-level patterns from bytes alone
-For production use, see [av-codes/prompt-injection-detector-v2-bordair](https://huggingface.co/av-codes/prompt-injection-detector-v2-bordair) (DistilBERT, 0.999 F1).
-## Limitations
-- 2,048 byte context truncates ~10% of samples (those longer than 2KB)
-- Byte-level tokenization makes sequences 5-8x longer than subword, increasing compute cost
-- Inference is slow (~12 samples/sec on L4) compared to subword models (~430 samples/sec)
-- Not tested on out-of-distribution attack types beyond bordair v1-v5
-## Citation
-Based on the HRM-Text architecture: [sapientinc/HRM-Text](https://github.com/sapientinc/HRM-Text)
-Dataset: [Bordair/bordair-multimodal](https://github.com/Josh-blythe/bordair-multimodal)

 ---
 license: mit
 tags:
   - prompt-injection
   - hrm-text
   - hierarchical-reasoning-model
+  - bordair-multimodal
+  - security
 ---
 # HRM-Text Prompt Injection Detector
+**Parameters:** 46,206,722
+**Architecture:** HRM-Text (classification port) | d=768, H=3, L=3, cycles=2×3
+**Context window:** 2,048 tokens (NTK-scaled RoPE)
+**Training data:** Bordair/bordair-multimodal (503K samples, balanced 1:1)
+Evaluation on stratified 10% holdout:
 | Metric | Value |
 |--------|-------|
+| Accuracy | 0.9893 |
+| Precision | 0.9934 |
+| Recall | 0.9838 |
+| F1 | 0.9886 |
 ## Architecture
+HRM-Text (arXiv:2506.21734) with a classification head. The model uses a recurrent cascade of two transformer modules (H and L) that exchange information across cycles:
+- **L module** (3 layers, low-level): processes detailed token patterns
+- **H module** (3 layers, high-level): integrates across cycles
+- **Recurrence**: 3 L-steps per H-cycle, 2 H-cycles total = 6 recurrent passes
+- **Classification**: last-token pooling + LayerNorm + Linear(2)
+The byte-level tokenizer (vocab 256) handles any text encoding. RoPE uses NTK-aware scaling (θ=10000.0, factor=1.0) for 2,048-token context.
+## Usage
+```python
+import torch
+from train_hrm_text_pi import HrmTextClassifier
+model = HrmTextClassifier(
+    hidden_size=768,
+    num_heads=12,
+    head_dim=64,
+    n_layers_H=3,
+    n_layers_L=3,
+)
+state_dict = torch.load("pytorch_model.bin", map_location="cpu")
+# Remove DDP wrapper keys if present
+state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
+model.load_state_dict(state_dict)
+model.eval()
+def detect(text, max_length=131072):
+    byte_ids = list(text.encode("utf-8", errors="replace")[:max_length])
+    input_ids = torch.tensor([byte_ids])
+    attention_mask = torch.ones_like(input_ids)
+    logits = model.inference(input_ids, attention_mask)
+    pred = logits.argmax(-1).item()  # 0=safe, 1=injection
+    return pred
+```