jsantillana commited on
Commit
7c820fa
·
verified ·
1 Parent(s): 508a454

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +262 -0
README.md ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VectraYX — Reproducibility Release
2
+
3
+ **Paper:** *VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use*
4
+
5
+ This repository contains the code, datasets, and pre-computed results needed to reproduce the key experiments from the paper.
6
+
7
+ ---
8
+
9
+ ## Repository Structure
10
+
11
+ ```
12
+ release/
13
+ ├── Makefile ← make repro / make bench-nano / make lora-nano
14
+ ├── requirements.txt ← exact package versions
15
+ ├── configs/
16
+ │ ├── nano.json ← Nano 42M architecture (GQA 8q/2kv, d_model=512)
17
+ │ └── base.json ← Base 260M architecture (GQA 16q/4kv, d_model=1024)
18
+ ├── training/
19
+ │ ├── transformer.py ← VectraYXNano model (GQA + QK-Norm + Z-loss + RoPE)
20
+ │ ├── pretrain.py ← 3-phase curriculum pre-training driver
21
+ │ ├── finetune_sft.py ← SFT with assistant-only loss masking + mini-curriculum
22
+ │ ├── finetune_lora_tools.py ← LoRA adapter injection + merge (key experiment)
23
+ │ ├── finetune_tools.py ← Full fine-tune (baseline comparison)
24
+ │ ├── sft_dataset.py ← JSONL → tokenized dataset with loss masking
25
+ │ ├── utils.py ← AdamW, cosine LR, checkpoint save/load
26
+ │ ├── aws_lora_nano_tools_s3.py ← SageMaker launcher: Nano LoRA (S3-only)
27
+ │ └── aws_lora_base_tools_s3.py ← SageMaker launcher: Base LoRA (S3-only)
28
+ ├── eval/
29
+ │ ├── benchmark.py ← VectraYX-Bench B1–B5 harness
30
+ │ ├── run_inference_lora.py ← Inference with LoRA adapter loaded
31
+ │ ├── run_inference_base.py ← Inference with base checkpoint
32
+ │ └── red_team_eval.py ← Adversarial probe script
33
+ ├── eval_data/
34
+ │ ├── b1_cveqa.jsonl ← 500 CVE Q&A prompts + expected keywords
35
+ │ ├── b2_classification.jsonl ← 200 threat classification examples
36
+ │ ├── b3_commands.jsonl ← 35 command-line completion prompts
37
+ │ ├── b4_tooluse.jsonl ← 25 tool-selection prompts (v2: 50 prompts)
38
+ │ └── b5_conversational.jsonl ← 10 conversational gate prompts
39
+ ├── corpus/
40
+ │ ├── tool_sft_mini_v1.jsonl ← 2,801 tool-use examples (ratio 1:21) ← KEY
41
+ │ ├── tool_sft_v3_bash.jsonl ← 296 bash-focused examples
42
+ │ ├── tool_sft_v2_simple.jsonl ← 115 simple bash examples
43
+ │ ├── b4_tooluse_v2.jsonl ← B4 benchmark v2 (50 questions, 60% bash)
44
+ │ ├── build_mini_tool_corpus.py ← Regenerate tool_sft_mini_v1 from scratch
45
+ │ ├── build_tool_sft_corpus.py ← Full tool-use corpus generator
46
+ │ └── build_v3_and_bench.py ← v3 corpus + benchmark builder
47
+ ├── results/
48
+ │ ├── bench_nano_baseline_multiseed.json ← Nano baseline N=4 seeds (paper Table 2)
49
+ │ ├── bench_nano_lora_multiseed.json ← Nano LoRA N=4 seeds (paper Table 3)
50
+ │ └── bench_base_lora_s42.json ← Base LoRA seed=42 (paper Table 3)
51
+ └── paper/
52
+ └── main.pdf ← Paper PDF
53
+ ```
54
+
55
+ ---
56
+
57
+ ## Key Finding: Tool-Use Corpus Density
58
+
59
+ The B4=0.000 floor in mixed SFT is a **corpus-density artifact**, not a capacity gate.
60
+
61
+ | Model | Corpus | Ratio | B4 |
62
+ |---|---|---|---|
63
+ | Nano 42M (mixed SFT, N=4 seeds) | 62K examples | 1:211 | **0.000** |
64
+ | **Nano 42M + LoRA (N=4 seeds)** | **2,801 examples** | **1:21** | **0.145 ± 0.046** |
65
+ | Base 260M (mixed SFT) | 62K examples | 1:211 | **0.000** |
66
+ | **Base 260M + LoRA** | **2,801 examples** | **1:21** | **0.580** |
67
+ | Pro 3B + LoRA-64 | 62K examples | ~1:10 | 0.600 |
68
+ | Pro 7B + QLoRA-32 | 62K examples | ~1:10 | 0.880 |
69
+
70
+ ### Nano LoRA Multi-Seed Results (N=4, Table 3 in paper)
71
+
72
+ | Seed | B1 KW | B2 F1 | B3 TM | **B4** | B5 |
73
+ |------|-------|-------|-------|--------|-----|
74
+ | 42 | 0.008 | 0.200 | 0.029 | **0.220** | 0.500 |
75
+ | 7 | 0.017 | 0.200 | 0.029 | **0.140** | 0.600 |
76
+ | 13 | 0.006 | 0.200 | 0.000 | **0.120** | 0.600 |
77
+ | 23 | 0.014 | 0.205 | 0.029 | **0.100** | 0.600 |
78
+ | **Mean ± std** | **0.011 ± 0.004** | **0.201 ± 0.002** | **0.021 ± 0.012** | **0.145 ± 0.046** | **0.575 ± 0.043** |
79
+
80
+ ---
81
+
82
+ ## Quick Start
83
+
84
+ ### 1. Install dependencies
85
+
86
+ ```bash
87
+ pip install -r requirements.txt
88
+ ```
89
+
90
+ ### 2. Download checkpoints
91
+
92
+ ```bash
93
+ mkdir -p checkpoints
94
+ # From HuggingFace (links TBD — see paper for GCS paths)
95
+ # Nano 42M post-SFT (503 MB)
96
+ # wget https://huggingface.co/vectrayx/nano-sft-v5/resolve/main/nano_sft_v5.pt \
97
+ # -O checkpoints/nano_sft_v5.pt
98
+ # Base 260M post-Phase3 (3.1 GB)
99
+ # wget https://huggingface.co/vectrayx/base-phase3/resolve/main/base_phase3_last.pt \
100
+ # -O checkpoints/base_phase3_last.pt
101
+ # Tokenizer (474 KB)
102
+ # wget https://huggingface.co/vectrayx/tokenizer/resolve/main/vectrayx_bpe.model \
103
+ # -O checkpoints/vectrayx_bpe.model
104
+ ```
105
+
106
+ ### 3. Run the full reproducibility suite
107
+
108
+ ```bash
109
+ make repro
110
+ ```
111
+
112
+ This runs:
113
+ 1. `make bench-nano` — B1–B5 on Nano baseline (expected B4=0.000)
114
+ 2. `make bench-base` — B1–B5 on Base baseline (expected B4=0.000)
115
+ 3. `make lora-nano` — LoRA fine-tune Nano + eval (expected B4≈0.220 for seed=42)
116
+ 4. `make lora-base` — LoRA fine-tune Base + eval (expected B4≈0.580 for seed=42)
117
+
118
+ ### 4. Run individual experiments
119
+
120
+ ```bash
121
+ # Benchmark only (no training)
122
+ make bench-nano
123
+ make bench-base
124
+
125
+ # LoRA fine-tune + benchmark
126
+ make lora-nano # ~30 min on A10G
127
+ make lora-base # ~45 min on A10G
128
+
129
+ # Regenerate corpus
130
+ make corpus
131
+ ```
132
+
133
+ ---
134
+
135
+ ## Reproducing the Pre-Training Pipeline
136
+
137
+ The full from-scratch pre-training pipeline (Phases 1–3 + SFT) is described in `training_v2/README.md` in the main repository. The key entry points are:
138
+
139
+ ```bash
140
+ # 1. Train tokenizer (BPE-16384, 50/50 conv/tech balance)
141
+ python -m training.tokenizer.train_spm_bpe \
142
+ --config configs/nano.json \
143
+ --corpus-root /path/to/corpus \
144
+ --out-dir checkpoints/tokenizer
145
+
146
+ # 2. Tokenize corpus → binary shards
147
+ python -m training.data.prepare_corpus \
148
+ --tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
149
+ --corpus-root /path/to/corpus \
150
+ --out-root data/bins
151
+
152
+ # 3. Pre-train (3 phases with replay buffer)
153
+ python training/pretrain.py --config configs/nano.json \
154
+ --bins data/bins --out checkpoints --phase 1 \
155
+ --batch-size 16 --grad-accum 8 --epochs 2
156
+ python training/pretrain.py --config configs/nano.json \
157
+ --bins data/bins --out checkpoints --phase 2 \
158
+ --resume checkpoints/phase1/last.pt
159
+ python training/pretrain.py --config configs/nano.json \
160
+ --bins data/bins --out checkpoints --phase 3 \
161
+ --resume checkpoints/phase2/last.pt
162
+
163
+ # 4. SFT with mini-curriculum
164
+ python training/finetune_sft.py \
165
+ --config configs/nano.json \
166
+ --tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
167
+ --resume checkpoints/phase3/last.pt \
168
+ --out checkpoints/sft_v5 \
169
+ --batch-size 16 --grad-accum 4 --epochs 3 --lr 2e-5
170
+
171
+ # 5. Benchmark
172
+ python eval/benchmark.py \
173
+ --config configs/nano.json \
174
+ --tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
175
+ --checkpoint checkpoints/sft_v5/final.pt \
176
+ --data-dir eval_data \
177
+ --out results/bench_nano_baseline.json
178
+ ```
179
+
180
+ **Estimated cost:** ~$12 USD on GCP L4 for 3 full runs (v2/v4/v6 ablations).
181
+
182
+ ---
183
+
184
+ ## SageMaker Experiments (LoRA)
185
+
186
+ The LoRA experiments were run on AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB).
187
+
188
+ ```bash
189
+ # Prerequisites: AWS CLI configured, S3 bucket with assets
190
+ # See training/aws_lora_nano_tools_s3.py for full setup
191
+
192
+ # Upload assets to S3
193
+ aws s3 cp checkpoints/nano_sft_v5.pt s3://YOUR_BUCKET/checkpoints/
194
+ aws s3 cp checkpoints/vectrayx_bpe.model s3://YOUR_BUCKET/tokenizers/
195
+ aws s3 cp corpus/tool_sft_mini_v1.jsonl s3://YOUR_BUCKET/training-data/
196
+
197
+ # Launch Nano LoRA (seed=42)
198
+ bash corpus/launch_nano_lora_mini_ondemand.sh
199
+
200
+ # Launch Base LoRA (seed=42)
201
+ bash corpus/launch_base_lora_mini_ondemand.sh
202
+ ```
203
+
204
+ **Estimated cost per run:** ~$1.50 USD (ml.g5.xlarge on-demand, ~45 min).
205
+
206
+ ---
207
+
208
+ ## Model Checkpoints
209
+
210
+ | Checkpoint | Size | Description | Link |
211
+ |---|---|---|---|
212
+ | `nano_sft_v5.pt` | 503 MB | Nano 42M post-SFT (base for LoRA) | HuggingFace (TBD) |
213
+ | `nano_lora_mini_s42.pt` | ~5 MB | Nano LoRA adapter (seed=42) | HuggingFace (TBD) |
214
+ | `base_phase3_last.pt` | 3.1 GB | Base 260M post-Phase3 (base for LoRA) | HuggingFace (TBD) |
215
+ | `base_lora_mini_s42.pt` | ~20 MB | Base LoRA adapter (seed=42) | HuggingFace (TBD) |
216
+ | `vectrayx_bpe.model` | 474 KB | BPE-16384 tokenizer | HuggingFace (TBD) |
217
+
218
+ ---
219
+
220
+ ## Environment
221
+
222
+ Experiments were run with:
223
+
224
+ | Package | Version |
225
+ |---|---|
226
+ | Python | 3.10 |
227
+ | PyTorch | 2.11.0 |
228
+ | sentencepiece | 0.2.1 |
229
+ | numpy | 2.4.2 |
230
+ | CUDA | 12.1 |
231
+ | boto3 | 1.42.93 |
232
+ | sagemaker | 3.10.0 |
233
+
234
+ Hardware:
235
+ - Pre-training: GCP `g2-standard-4` (NVIDIA L4 24GB), `us-west1-a`
236
+ - LoRA experiments: AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB), `us-east-1`
237
+ - Multi-seed runs: AWS EC2 `g4dn.xlarge` (NVIDIA T4 16GB)
238
+
239
+ ---
240
+
241
+ ## Citation
242
+
243
+ ```bibtex
244
+ @inproceedings{santillana2026vectrayx,
245
+ title = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model
246
+ with Curriculum Learning and Native Tool Use},
247
+ author = {Santillana, Juan S.},
248
+ booktitle = {Preprint},
249
+ year = {2026}
250
+ }
251
+ ```
252
+
253
+ ---
254
+
255
+ ## License
256
+
257
+ | Component | License |
258
+ |---|---|
259
+ | Training code | MIT |
260
+ | Evaluation datasets (B1–B5) | CC-BY-4.0 |
261
+ | Model weights | Apache 2.0 |
262
+ | Paper | CC-BY-4.0 |