av-codes commited on
Commit
6836f3e
·
verified ·
1 Parent(s): 8675713

Add proper model card

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language: en
4
+ tags:
5
+ - prompt-injection
6
+ - security
7
+ - hrm-text
8
+ - hierarchical-reasoning-model
9
+ - from-scratch
10
+ - byte-level
11
+ datasets:
12
+ - Bordair/bordair-multimodal
13
+ metrics:
14
+ - f1
15
+ - accuracy
16
+ - precision
17
+ - recall
18
+ model-index:
19
+ - name: prompt-injection-hrm-text
20
+ results:
21
+ - task:
22
+ type: text-classification
23
+ name: Prompt Injection Detection
24
+ dataset:
25
+ name: Bordair Multimodal
26
+ type: Bordair/bordair-multimodal
27
+ metrics:
28
+ - type: f1
29
+ value: 0.974
30
+ - type: accuracy
31
+ value: 0.9762
32
+ - type: precision
33
+ value: 0.9858
34
+ - type: recall
35
+ value: 0.9633
36
+ ---
37
+
38
+ # HRM-Text Prompt Injection Detector
39
+
40
+ A from-scratch, byte-level prompt injection detector using the HRM-Text (Hierarchical Reasoning Model for Text) architecture. Trained on the Bordair multimodal dataset (476K samples) with v1-v5 adversarial attacks.
41
+
42
+ ## Key Results
43
+
44
+ | Metric | Value |
45
+ |--------|-------|
46
+ | F1 | **0.974** |
47
+ | Accuracy | 0.976 |
48
+ | Precision | 0.986 |
49
+ | Recall | 0.963 |
50
+
51
+ ### Comparison on Bordair Multimodal (same eval split, 47,644 samples)
52
+
53
+ | Model | F1 | Params | Training | Cost | Pretrained? |
54
+ |-------|-----|--------|----------|------|------------|
55
+ | DistilBERT v2 (61K data, zero-shot) | 0.278 | 67M | - | - | Yes (wrong data) |
56
+ | **HRM-Text (this model)** | **0.974** | 46.2M | 33h | ~7 | No |
57
+ | DistilBERT (bordair fine-tuned) | 0.999 | 67M | 0.9h | /bin/bash.72 | Yes |
58
+
59
+ ## Architecture
60
+
61
+ HRM-Text is a hierarchical recurrent transformer trained from raw bytes (no tokenizer, no pretrained weights).
62
+
63
+ - **Input**: raw UTF-8 bytes (vocab=256), max 2,048 bytes
64
+ - **L-module**: 3-layer transformer processing byte sequences (local patterns, word boundaries)
65
+ - **H-module**: 3-layer transformer reasoning over L-module output (sentence meaning, intent)
66
+ - **Recurrent cascade**: L runs 3x, then H runs 1x, repeated 2 cycles
67
+ - **BP warmup**: gradient flow depth increases from 2 to 5 recurrent steps over first 20% of training
68
+ - **Position encoding**: RoPE (Rotary Position Embeddings)
69
+ - **FFN**: SwiGLU (same as LLaMA)
70
+ - **Attention**: causal, with gradient checkpointing
71
+ - **Classification**: last-token pooling from H-module -> linear head -> binary (safe/injection)
72
+ - **Parameters**: 46.2M
73
+
74
+ ## Training
75
+
76
+ - **Dataset**: [Bordair/bordair-multimodal](https://huggingface.co/datasets/Bordair/bordair-multimodal) -- 476,431 samples (224,855 injection + 251,576 safe)
77
+ - **Attack types**: v1-v5 escalating sophistication (multi-turn, cross-modal, MCP injection, reasoning DoS)
78
+ - **Split**: 90/10 stratified (seed=42) -- 428,787 train / 47,644 eval
79
+ - **Hardware**: NVIDIA L4 (24GB VRAM) via HF Jobs
80
+ - **Precision**: fp16
81
+ - **Batch size**: 32
82
+ - **Learning rate**: 5e-4 with cosine schedule
83
+ - **Warmup**: 500 steps
84
+ - **Context**: 2,048 bytes (covers ~90% of samples; p95=3,916 bytes)
85
+ - **Step time**: ~3 sec/step on L4
86
+
87
+ ### Eval Progression
88
+
89
+ | Step | Epoch | F1 | Notes |
90
+ |------|-------|-----|-------|
91
+ | 4,000 | 0.30 | 0.971 | First eval |
92
+ | 8,000 | 0.60 | **0.974** | Best checkpoint (this model) |
93
+ | 12,000 | 0.90 | 0.934 | Temporary regression |
94
+ | 16,000 | 1.19 | 0.971 | Recovered; NaN grad_norms appearing |
95
+
96
+ Training was stopped at epoch ~1.2 due to fp16 instability (NaN gradients) and F1 plateau. Best checkpoint at step 8,000 was retained.
97
+
98
+ ## Why This Model Exists
99
+
100
+ This is a research model exploring whether hierarchical recurrent transformers can learn effective text classification from raw bytes without pretrained language understanding.
101
+
102
+ **Key findings:**
103
+ - A from-scratch byte-level model can reach 0.974 F1 on adversarial prompt injection -- remarkably close to pretrained models
104
+ - Data matters more than architecture: DistilBERT trained on 61K simpler samples scores only 0.278 F1 on this dataset
105
+ - Pretrained + right data still wins: DistilBERT fine-tuned on the same bordair data hits 0.999 F1 in 1/37th the time
106
+ - The hierarchical recurrence successfully learns word-level and phrase-level patterns from bytes alone
107
+
108
+ For production use, see [av-codes/prompt-injection-detector-v2-bordair](https://huggingface.co/av-codes/prompt-injection-detector-v2-bordair) (DistilBERT, 0.999 F1).
109
+
110
+ ## Limitations
111
+
112
+ - 2,048 byte context truncates ~10% of samples (those longer than 2KB)
113
+ - Byte-level tokenization makes sequences 5-8x longer than subword, increasing compute cost
114
+ - Inference is slow (~12 samples/sec on L4) compared to subword models (~430 samples/sec)
115
+ - Not tested on out-of-distribution attack types beyond bordair v1-v5
116
+
117
+ ## Citation
118
+
119
+ Based on the HRM-Text architecture: [sapientinc/HRM-Text](https://github.com/sapientinc/HRM-Text)
120
+
121
+ Dataset: [Bordair/bordair-multimodal](https://github.com/Josh-blythe/bordair-multimodal)