av-codes commited on
Commit
44cbed0
·
verified ·
1 Parent(s): af025e6

Add README

Browse files
Files changed (1) hide show
  1. README.md +46 -104
README.md CHANGED
@@ -1,121 +1,63 @@
1
  ---
2
  license: mit
3
- language: en
4
  tags:
5
  - prompt-injection
6
- - security
7
  - hrm-text
8
  - hierarchical-reasoning-model
9
- - from-scratch
10
- - byte-level
11
- datasets:
12
- - Bordair/bordair-multimodal
13
- metrics:
14
- - f1
15
- - accuracy
16
- - precision
17
- - recall
18
- model-index:
19
- - name: prompt-injection-hrm-text
20
- results:
21
- - task:
22
- type: text-classification
23
- name: Prompt Injection Detection
24
- dataset:
25
- name: Bordair Multimodal
26
- type: Bordair/bordair-multimodal
27
- metrics:
28
- - type: f1
29
- value: 0.974
30
- - type: accuracy
31
- value: 0.9762
32
- - type: precision
33
- value: 0.9858
34
- - type: recall
35
- value: 0.9633
36
  ---
37
 
38
  # HRM-Text Prompt Injection Detector
39
 
40
- A from-scratch, byte-level prompt injection detector using the HRM-Text (Hierarchical Reasoning Model for Text) architecture. Trained on the Bordair multimodal dataset (476K samples) with v1-v5 adversarial attacks.
 
 
 
41
 
42
- ## Key Results
43
 
44
  | Metric | Value |
45
  |--------|-------|
46
- | F1 | **0.974** |
47
- | Accuracy | 0.976 |
48
- | Precision | 0.986 |
49
- | Recall | 0.963 |
50
-
51
- ### Comparison on Bordair Multimodal (same eval split, 47,644 samples)
52
-
53
- | Model | F1 | Params | Training | Cost | Pretrained? |
54
- |-------|-----|--------|----------|------|------------|
55
- | DistilBERT v2 (61K data, zero-shot) | 0.278 | 67M | - | - | Yes (wrong data) |
56
- | **HRM-Text (this model)** | **0.974** | 46.2M | 33h | ~7 | No |
57
- | DistilBERT (bordair fine-tuned) | 0.999 | 67M | 0.9h | /bin/bash.72 | Yes |
58
 
59
  ## Architecture
60
 
61
- HRM-Text is a hierarchical recurrent transformer trained from raw bytes (no tokenizer, no pretrained weights).
62
-
63
- - **Input**: raw UTF-8 bytes (vocab=256), max 2,048 bytes
64
- - **L-module**: 3-layer transformer processing byte sequences (local patterns, word boundaries)
65
- - **H-module**: 3-layer transformer reasoning over L-module output (sentence meaning, intent)
66
- - **Recurrent cascade**: L runs 3x, then H runs 1x, repeated 2 cycles
67
- - **BP warmup**: gradient flow depth increases from 2 to 5 recurrent steps over first 20% of training
68
- - **Position encoding**: RoPE (Rotary Position Embeddings)
69
- - **FFN**: SwiGLU (same as LLaMA)
70
- - **Attention**: causal, with gradient checkpointing
71
- - **Classification**: last-token pooling from H-module -> linear head -> binary (safe/injection)
72
- - **Parameters**: 46.2M
73
-
74
- ## Training
75
-
76
- - **Dataset**: [Bordair/bordair-multimodal](https://huggingface.co/datasets/Bordair/bordair-multimodal) -- 476,431 samples (224,855 injection + 251,576 safe)
77
- - **Attack types**: v1-v5 escalating sophistication (multi-turn, cross-modal, MCP injection, reasoning DoS)
78
- - **Split**: 90/10 stratified (seed=42) -- 428,787 train / 47,644 eval
79
- - **Hardware**: NVIDIA L4 (24GB VRAM) via HF Jobs
80
- - **Precision**: fp16
81
- - **Batch size**: 32
82
- - **Learning rate**: 5e-4 with cosine schedule
83
- - **Warmup**: 500 steps
84
- - **Context**: 2,048 bytes (covers ~90% of samples; p95=3,916 bytes)
85
- - **Step time**: ~3 sec/step on L4
86
-
87
- ### Eval Progression
88
-
89
- | Step | Epoch | F1 | Notes |
90
- |------|-------|-----|-------|
91
- | 4,000 | 0.30 | 0.971 | First eval |
92
- | 8,000 | 0.60 | **0.974** | Best checkpoint (this model) |
93
- | 12,000 | 0.90 | 0.934 | Temporary regression |
94
- | 16,000 | 1.19 | 0.971 | Recovered; NaN grad_norms appearing |
95
-
96
- Training was stopped at epoch ~1.2 due to fp16 instability (NaN gradients) and F1 plateau. Best checkpoint at step 8,000 was retained.
97
-
98
- ## Why This Model Exists
99
-
100
- This is a research model exploring whether hierarchical recurrent transformers can learn effective text classification from raw bytes without pretrained language understanding.
101
-
102
- **Key findings:**
103
- - A from-scratch byte-level model can reach 0.974 F1 on adversarial prompt injection -- remarkably close to pretrained models
104
- - Data matters more than architecture: DistilBERT trained on 61K simpler samples scores only 0.278 F1 on this dataset
105
- - Pretrained + right data still wins: DistilBERT fine-tuned on the same bordair data hits 0.999 F1 in 1/37th the time
106
- - The hierarchical recurrence successfully learns word-level and phrase-level patterns from bytes alone
107
-
108
- For production use, see [av-codes/prompt-injection-detector-v2-bordair](https://huggingface.co/av-codes/prompt-injection-detector-v2-bordair) (DistilBERT, 0.999 F1).
109
-
110
- ## Limitations
111
-
112
- - 2,048 byte context truncates ~10% of samples (those longer than 2KB)
113
- - Byte-level tokenization makes sequences 5-8x longer than subword, increasing compute cost
114
- - Inference is slow (~12 samples/sec on L4) compared to subword models (~430 samples/sec)
115
- - Not tested on out-of-distribution attack types beyond bordair v1-v5
116
-
117
- ## Citation
118
-
119
- Based on the HRM-Text architecture: [sapientinc/HRM-Text](https://github.com/sapientinc/HRM-Text)
120
-
121
- Dataset: [Bordair/bordair-multimodal](https://github.com/Josh-blythe/bordair-multimodal)
 
1
  ---
2
  license: mit
 
3
  tags:
4
  - prompt-injection
 
5
  - hrm-text
6
  - hierarchical-reasoning-model
7
+ - bordair-multimodal
8
+ - security
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
  # HRM-Text Prompt Injection Detector
12
 
13
+ **Parameters:** 46,206,722
14
+ **Architecture:** HRM-Text (classification port) | d=768, H=3, L=3, cycles=2×3
15
+ **Context window:** 2,048 tokens (NTK-scaled RoPE)
16
+ **Training data:** Bordair/bordair-multimodal (503K samples, balanced 1:1)
17
 
18
+ Evaluation on stratified 10% holdout:
19
 
20
  | Metric | Value |
21
  |--------|-------|
22
+ | Accuracy | 0.9893 |
23
+ | Precision | 0.9934 |
24
+ | Recall | 0.9838 |
25
+ | F1 | 0.9886 |
 
 
 
 
 
 
 
 
26
 
27
  ## Architecture
28
 
29
+ HRM-Text (arXiv:2506.21734) with a classification head. The model uses a recurrent cascade of two transformer modules (H and L) that exchange information across cycles:
30
+
31
+ - **L module** (3 layers, low-level): processes detailed token patterns
32
+ - **H module** (3 layers, high-level): integrates across cycles
33
+ - **Recurrence**: 3 L-steps per H-cycle, 2 H-cycles total = 6 recurrent passes
34
+ - **Classification**: last-token pooling + LayerNorm + Linear(2)
35
+
36
+ The byte-level tokenizer (vocab 256) handles any text encoding. RoPE uses NTK-aware scaling (θ=10000.0, factor=1.0) for 2,048-token context.
37
+
38
+ ## Usage
39
+ ```python
40
+ import torch
41
+ from train_hrm_text_pi import HrmTextClassifier
42
+
43
+ model = HrmTextClassifier(
44
+ hidden_size=768,
45
+ num_heads=12,
46
+ head_dim=64,
47
+ n_layers_H=3,
48
+ n_layers_L=3,
49
+ )
50
+ state_dict = torch.load("pytorch_model.bin", map_location="cpu")
51
+ # Remove DDP wrapper keys if present
52
+ state_dict = {k.replace('module.', ''): v for k, v in state_dict.items()}
53
+ model.load_state_dict(state_dict)
54
+ model.eval()
55
+
56
+ def detect(text, max_length=131072):
57
+ byte_ids = list(text.encode("utf-8", errors="replace")[:max_length])
58
+ input_ids = torch.tensor([byte_ids])
59
+ attention_mask = torch.ones_like(input_ids)
60
+ logits = model.inference(input_ids, attention_mask)
61
+ pred = logits.argmax(-1).item() # 0=safe, 1=injection
62
+ return pred
63
+ ```