av-codes
/

prompt-injection-hrm-text

+---
+license: mit
+language: en
+tags:
+  - prompt-injection
+  - security
+  - hrm-text
+  - hierarchical-reasoning-model
+  - from-scratch
+  - byte-level
+datasets:
+  - Bordair/bordair-multimodal
+metrics:
+  - f1
+  - accuracy
+  - precision
+  - recall
+model-index:
+  - name: prompt-injection-hrm-text
+    results:
+      - task:
+          type: text-classification
+          name: Prompt Injection Detection
+        dataset:
+          name: Bordair Multimodal
+          type: Bordair/bordair-multimodal
+        metrics:
+          - type: f1
+            value: 0.974
+          - type: accuracy
+            value: 0.9762
+          - type: precision
+            value: 0.9858
+          - type: recall
+            value: 0.9633
+---
+# HRM-Text Prompt Injection Detector
+A from-scratch, byte-level prompt injection detector using the HRM-Text (Hierarchical Reasoning Model for Text) architecture. Trained on the Bordair multimodal dataset (476K samples) with v1-v5 adversarial attacks.
+## Key Results
+| Metric | Value |
+|--------|-------|
+| F1 | **0.974** |
+| Accuracy | 0.976 |
+| Precision | 0.986 |
+| Recall | 0.963 |
+### Comparison on Bordair Multimodal (same eval split, 47,644 samples)
+| Model | F1 | Params | Training | Cost | Pretrained? |
+|-------|-----|--------|----------|------|------------|
+| DistilBERT v2 (61K data, zero-shot) | 0.278 | 67M | - | - | Yes (wrong data) |
+| **HRM-Text (this model)** | **0.974** | 46.2M | 33h | ~7 | No |
+| DistilBERT (bordair fine-tuned) | 0.999 | 67M | 0.9h | /bin/bash.72 | Yes |
+## Architecture
+HRM-Text is a hierarchical recurrent transformer trained from raw bytes (no tokenizer, no pretrained weights).
+- **Input**: raw UTF-8 bytes (vocab=256), max 2,048 bytes
+- **L-module**: 3-layer transformer processing byte sequences (local patterns, word boundaries)
+- **H-module**: 3-layer transformer reasoning over L-module output (sentence meaning, intent)
+- **Recurrent cascade**: L runs 3x, then H runs 1x, repeated 2 cycles
+- **BP warmup**: gradient flow depth increases from 2 to 5 recurrent steps over first 20% of training
+- **Position encoding**: RoPE (Rotary Position Embeddings)
+- **FFN**: SwiGLU (same as LLaMA)
+- **Attention**: causal, with gradient checkpointing
+- **Classification**: last-token pooling from H-module -> linear head -> binary (safe/injection)
+- **Parameters**: 46.2M
+## Training
+- **Dataset**: [Bordair/bordair-multimodal](https://huggingface.co/datasets/Bordair/bordair-multimodal) -- 476,431 samples (224,855 injection + 251,576 safe)
+- **Attack types**: v1-v5 escalating sophistication (multi-turn, cross-modal, MCP injection, reasoning DoS)
+- **Split**: 90/10 stratified (seed=42) -- 428,787 train / 47,644 eval
+- **Hardware**: NVIDIA L4 (24GB VRAM) via HF Jobs
+- **Precision**: fp16
+- **Batch size**: 32
+- **Learning rate**: 5e-4 with cosine schedule
+- **Warmup**: 500 steps
+- **Context**: 2,048 bytes (covers ~90% of samples; p95=3,916 bytes)
+- **Step time**: ~3 sec/step on L4
+### Eval Progression
+| Step | Epoch | F1 | Notes |
+|------|-------|-----|-------|
+| 4,000 | 0.30 | 0.971 | First eval |
+| 8,000 | 0.60 | **0.974** | Best checkpoint (this model) |
+| 12,000 | 0.90 | 0.934 | Temporary regression |
+| 16,000 | 1.19 | 0.971 | Recovered; NaN grad_norms appearing |
+Training was stopped at epoch ~1.2 due to fp16 instability (NaN gradients) and F1 plateau. Best checkpoint at step 8,000 was retained.
+## Why This Model Exists
+This is a research model exploring whether hierarchical recurrent transformers can learn effective text classification from raw bytes without pretrained language understanding.
+**Key findings:**
+- A from-scratch byte-level model can reach 0.974 F1 on adversarial prompt injection -- remarkably close to pretrained models
+- Data matters more than architecture: DistilBERT trained on 61K simpler samples scores only 0.278 F1 on this dataset
+- Pretrained + right data still wins: DistilBERT fine-tuned on the same bordair data hits 0.999 F1 in 1/37th the time
+- The hierarchical recurrence successfully learns word-level and phrase-level patterns from bytes alone
+For production use, see [av-codes/prompt-injection-detector-v2-bordair](https://huggingface.co/av-codes/prompt-injection-detector-v2-bordair) (DistilBERT, 0.999 F1).
+## Limitations
+- 2,048 byte context truncates ~10% of samples (those longer than 2KB)
+- Byte-level tokenization makes sequences 5-8x longer than subword, increasing compute cost
+- Inference is slow (~12 samples/sec on L4) compared to subword models (~430 samples/sec)
+- Not tested on out-of-distribution attack types beyond bordair v1-v5
+## Citation
+Based on the HRM-Text architecture: [sapientinc/HRM-Text](https://github.com/sapientinc/HRM-Text)
+Dataset: [Bordair/bordair-multimodal](https://github.com/Josh-blythe/bordair-multimodal)