Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# VectraYX — Reproducibility Release
|
| 2 |
+
|
| 3 |
+
**Paper:** *VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use*
|
| 4 |
+
|
| 5 |
+
This repository contains the code, datasets, and pre-computed results needed to reproduce the key experiments from the paper.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Repository Structure
|
| 10 |
+
|
| 11 |
+
```
|
| 12 |
+
release/
|
| 13 |
+
├── Makefile ← make repro / make bench-nano / make lora-nano
|
| 14 |
+
├── requirements.txt ← exact package versions
|
| 15 |
+
├── configs/
|
| 16 |
+
│ ├── nano.json ← Nano 42M architecture (GQA 8q/2kv, d_model=512)
|
| 17 |
+
│ └── base.json ← Base 260M architecture (GQA 16q/4kv, d_model=1024)
|
| 18 |
+
├── training/
|
| 19 |
+
│ ├── transformer.py ← VectraYXNano model (GQA + QK-Norm + Z-loss + RoPE)
|
| 20 |
+
│ ├── pretrain.py ← 3-phase curriculum pre-training driver
|
| 21 |
+
│ ├── finetune_sft.py ← SFT with assistant-only loss masking + mini-curriculum
|
| 22 |
+
│ ├── finetune_lora_tools.py ← LoRA adapter injection + merge (key experiment)
|
| 23 |
+
│ ├── finetune_tools.py ← Full fine-tune (baseline comparison)
|
| 24 |
+
│ ├── sft_dataset.py ← JSONL → tokenized dataset with loss masking
|
| 25 |
+
│ ├── utils.py ← AdamW, cosine LR, checkpoint save/load
|
| 26 |
+
│ ├── aws_lora_nano_tools_s3.py ← SageMaker launcher: Nano LoRA (S3-only)
|
| 27 |
+
│ └── aws_lora_base_tools_s3.py ← SageMaker launcher: Base LoRA (S3-only)
|
| 28 |
+
├── eval/
|
| 29 |
+
│ ├── benchmark.py ← VectraYX-Bench B1–B5 harness
|
| 30 |
+
│ ├── run_inference_lora.py ← Inference with LoRA adapter loaded
|
| 31 |
+
│ ├── run_inference_base.py ← Inference with base checkpoint
|
| 32 |
+
│ └── red_team_eval.py ← Adversarial probe script
|
| 33 |
+
├── eval_data/
|
| 34 |
+
│ ├── b1_cveqa.jsonl ← 500 CVE Q&A prompts + expected keywords
|
| 35 |
+
│ ├── b2_classification.jsonl ← 200 threat classification examples
|
| 36 |
+
│ ├── b3_commands.jsonl ← 35 command-line completion prompts
|
| 37 |
+
│ ├── b4_tooluse.jsonl ← 25 tool-selection prompts (v2: 50 prompts)
|
| 38 |
+
│ └── b5_conversational.jsonl ← 10 conversational gate prompts
|
| 39 |
+
├── corpus/
|
| 40 |
+
│ ├── tool_sft_mini_v1.jsonl ← 2,801 tool-use examples (ratio 1:21) ← KEY
|
| 41 |
+
│ ├── tool_sft_v3_bash.jsonl ← 296 bash-focused examples
|
| 42 |
+
│ ├── tool_sft_v2_simple.jsonl ← 115 simple bash examples
|
| 43 |
+
│ ├── b4_tooluse_v2.jsonl ← B4 benchmark v2 (50 questions, 60% bash)
|
| 44 |
+
│ ├── build_mini_tool_corpus.py ← Regenerate tool_sft_mini_v1 from scratch
|
| 45 |
+
│ ├── build_tool_sft_corpus.py ← Full tool-use corpus generator
|
| 46 |
+
│ └── build_v3_and_bench.py ← v3 corpus + benchmark builder
|
| 47 |
+
├── results/
|
| 48 |
+
│ ├── bench_nano_baseline_multiseed.json ← Nano baseline N=4 seeds (paper Table 2)
|
| 49 |
+
│ ├── bench_nano_lora_multiseed.json ← Nano LoRA N=4 seeds (paper Table 3)
|
| 50 |
+
│ └── bench_base_lora_s42.json ← Base LoRA seed=42 (paper Table 3)
|
| 51 |
+
└── paper/
|
| 52 |
+
└── main.pdf ← Paper PDF
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
+
|
| 57 |
+
## Key Finding: Tool-Use Corpus Density
|
| 58 |
+
|
| 59 |
+
The B4=0.000 floor in mixed SFT is a **corpus-density artifact**, not a capacity gate.
|
| 60 |
+
|
| 61 |
+
| Model | Corpus | Ratio | B4 |
|
| 62 |
+
|---|---|---|---|
|
| 63 |
+
| Nano 42M (mixed SFT, N=4 seeds) | 62K examples | 1:211 | **0.000** |
|
| 64 |
+
| **Nano 42M + LoRA (N=4 seeds)** | **2,801 examples** | **1:21** | **0.145 ± 0.046** |
|
| 65 |
+
| Base 260M (mixed SFT) | 62K examples | 1:211 | **0.000** |
|
| 66 |
+
| **Base 260M + LoRA** | **2,801 examples** | **1:21** | **0.580** |
|
| 67 |
+
| Pro 3B + LoRA-64 | 62K examples | ~1:10 | 0.600 |
|
| 68 |
+
| Pro 7B + QLoRA-32 | 62K examples | ~1:10 | 0.880 |
|
| 69 |
+
|
| 70 |
+
### Nano LoRA Multi-Seed Results (N=4, Table 3 in paper)
|
| 71 |
+
|
| 72 |
+
| Seed | B1 KW | B2 F1 | B3 TM | **B4** | B5 |
|
| 73 |
+
|------|-------|-------|-------|--------|-----|
|
| 74 |
+
| 42 | 0.008 | 0.200 | 0.029 | **0.220** | 0.500 |
|
| 75 |
+
| 7 | 0.017 | 0.200 | 0.029 | **0.140** | 0.600 |
|
| 76 |
+
| 13 | 0.006 | 0.200 | 0.000 | **0.120** | 0.600 |
|
| 77 |
+
| 23 | 0.014 | 0.205 | 0.029 | **0.100** | 0.600 |
|
| 78 |
+
| **Mean ± std** | **0.011 ± 0.004** | **0.201 ± 0.002** | **0.021 ± 0.012** | **0.145 ± 0.046** | **0.575 ± 0.043** |
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## Quick Start
|
| 83 |
+
|
| 84 |
+
### 1. Install dependencies
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
pip install -r requirements.txt
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### 2. Download checkpoints
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
mkdir -p checkpoints
|
| 94 |
+
# From HuggingFace (links TBD — see paper for GCS paths)
|
| 95 |
+
# Nano 42M post-SFT (503 MB)
|
| 96 |
+
# wget https://huggingface.co/vectrayx/nano-sft-v5/resolve/main/nano_sft_v5.pt \
|
| 97 |
+
# -O checkpoints/nano_sft_v5.pt
|
| 98 |
+
# Base 260M post-Phase3 (3.1 GB)
|
| 99 |
+
# wget https://huggingface.co/vectrayx/base-phase3/resolve/main/base_phase3_last.pt \
|
| 100 |
+
# -O checkpoints/base_phase3_last.pt
|
| 101 |
+
# Tokenizer (474 KB)
|
| 102 |
+
# wget https://huggingface.co/vectrayx/tokenizer/resolve/main/vectrayx_bpe.model \
|
| 103 |
+
# -O checkpoints/vectrayx_bpe.model
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### 3. Run the full reproducibility suite
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
make repro
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
This runs:
|
| 113 |
+
1. `make bench-nano` — B1–B5 on Nano baseline (expected B4=0.000)
|
| 114 |
+
2. `make bench-base` — B1–B5 on Base baseline (expected B4=0.000)
|
| 115 |
+
3. `make lora-nano` — LoRA fine-tune Nano + eval (expected B4≈0.220 for seed=42)
|
| 116 |
+
4. `make lora-base` — LoRA fine-tune Base + eval (expected B4≈0.580 for seed=42)
|
| 117 |
+
|
| 118 |
+
### 4. Run individual experiments
|
| 119 |
+
|
| 120 |
+
```bash
|
| 121 |
+
# Benchmark only (no training)
|
| 122 |
+
make bench-nano
|
| 123 |
+
make bench-base
|
| 124 |
+
|
| 125 |
+
# LoRA fine-tune + benchmark
|
| 126 |
+
make lora-nano # ~30 min on A10G
|
| 127 |
+
make lora-base # ~45 min on A10G
|
| 128 |
+
|
| 129 |
+
# Regenerate corpus
|
| 130 |
+
make corpus
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
---
|
| 134 |
+
|
| 135 |
+
## Reproducing the Pre-Training Pipeline
|
| 136 |
+
|
| 137 |
+
The full from-scratch pre-training pipeline (Phases 1–3 + SFT) is described in `training_v2/README.md` in the main repository. The key entry points are:
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
# 1. Train tokenizer (BPE-16384, 50/50 conv/tech balance)
|
| 141 |
+
python -m training.tokenizer.train_spm_bpe \
|
| 142 |
+
--config configs/nano.json \
|
| 143 |
+
--corpus-root /path/to/corpus \
|
| 144 |
+
--out-dir checkpoints/tokenizer
|
| 145 |
+
|
| 146 |
+
# 2. Tokenize corpus → binary shards
|
| 147 |
+
python -m training.data.prepare_corpus \
|
| 148 |
+
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
|
| 149 |
+
--corpus-root /path/to/corpus \
|
| 150 |
+
--out-root data/bins
|
| 151 |
+
|
| 152 |
+
# 3. Pre-train (3 phases with replay buffer)
|
| 153 |
+
python training/pretrain.py --config configs/nano.json \
|
| 154 |
+
--bins data/bins --out checkpoints --phase 1 \
|
| 155 |
+
--batch-size 16 --grad-accum 8 --epochs 2
|
| 156 |
+
python training/pretrain.py --config configs/nano.json \
|
| 157 |
+
--bins data/bins --out checkpoints --phase 2 \
|
| 158 |
+
--resume checkpoints/phase1/last.pt
|
| 159 |
+
python training/pretrain.py --config configs/nano.json \
|
| 160 |
+
--bins data/bins --out checkpoints --phase 3 \
|
| 161 |
+
--resume checkpoints/phase2/last.pt
|
| 162 |
+
|
| 163 |
+
# 4. SFT with mini-curriculum
|
| 164 |
+
python training/finetune_sft.py \
|
| 165 |
+
--config configs/nano.json \
|
| 166 |
+
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
|
| 167 |
+
--resume checkpoints/phase3/last.pt \
|
| 168 |
+
--out checkpoints/sft_v5 \
|
| 169 |
+
--batch-size 16 --grad-accum 4 --epochs 3 --lr 2e-5
|
| 170 |
+
|
| 171 |
+
# 5. Benchmark
|
| 172 |
+
python eval/benchmark.py \
|
| 173 |
+
--config configs/nano.json \
|
| 174 |
+
--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
|
| 175 |
+
--checkpoint checkpoints/sft_v5/final.pt \
|
| 176 |
+
--data-dir eval_data \
|
| 177 |
+
--out results/bench_nano_baseline.json
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
**Estimated cost:** ~$12 USD on GCP L4 for 3 full runs (v2/v4/v6 ablations).
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
## SageMaker Experiments (LoRA)
|
| 185 |
+
|
| 186 |
+
The LoRA experiments were run on AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB).
|
| 187 |
+
|
| 188 |
+
```bash
|
| 189 |
+
# Prerequisites: AWS CLI configured, S3 bucket with assets
|
| 190 |
+
# See training/aws_lora_nano_tools_s3.py for full setup
|
| 191 |
+
|
| 192 |
+
# Upload assets to S3
|
| 193 |
+
aws s3 cp checkpoints/nano_sft_v5.pt s3://YOUR_BUCKET/checkpoints/
|
| 194 |
+
aws s3 cp checkpoints/vectrayx_bpe.model s3://YOUR_BUCKET/tokenizers/
|
| 195 |
+
aws s3 cp corpus/tool_sft_mini_v1.jsonl s3://YOUR_BUCKET/training-data/
|
| 196 |
+
|
| 197 |
+
# Launch Nano LoRA (seed=42)
|
| 198 |
+
bash corpus/launch_nano_lora_mini_ondemand.sh
|
| 199 |
+
|
| 200 |
+
# Launch Base LoRA (seed=42)
|
| 201 |
+
bash corpus/launch_base_lora_mini_ondemand.sh
|
| 202 |
+
```
|
| 203 |
+
|
| 204 |
+
**Estimated cost per run:** ~$1.50 USD (ml.g5.xlarge on-demand, ~45 min).
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
+
|
| 208 |
+
## Model Checkpoints
|
| 209 |
+
|
| 210 |
+
| Checkpoint | Size | Description | Link |
|
| 211 |
+
|---|---|---|---|
|
| 212 |
+
| `nano_sft_v5.pt` | 503 MB | Nano 42M post-SFT (base for LoRA) | HuggingFace (TBD) |
|
| 213 |
+
| `nano_lora_mini_s42.pt` | ~5 MB | Nano LoRA adapter (seed=42) | HuggingFace (TBD) |
|
| 214 |
+
| `base_phase3_last.pt` | 3.1 GB | Base 260M post-Phase3 (base for LoRA) | HuggingFace (TBD) |
|
| 215 |
+
| `base_lora_mini_s42.pt` | ~20 MB | Base LoRA adapter (seed=42) | HuggingFace (TBD) |
|
| 216 |
+
| `vectrayx_bpe.model` | 474 KB | BPE-16384 tokenizer | HuggingFace (TBD) |
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## Environment
|
| 221 |
+
|
| 222 |
+
Experiments were run with:
|
| 223 |
+
|
| 224 |
+
| Package | Version |
|
| 225 |
+
|---|---|
|
| 226 |
+
| Python | 3.10 |
|
| 227 |
+
| PyTorch | 2.11.0 |
|
| 228 |
+
| sentencepiece | 0.2.1 |
|
| 229 |
+
| numpy | 2.4.2 |
|
| 230 |
+
| CUDA | 12.1 |
|
| 231 |
+
| boto3 | 1.42.93 |
|
| 232 |
+
| sagemaker | 3.10.0 |
|
| 233 |
+
|
| 234 |
+
Hardware:
|
| 235 |
+
- Pre-training: GCP `g2-standard-4` (NVIDIA L4 24GB), `us-west1-a`
|
| 236 |
+
- LoRA experiments: AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB), `us-east-1`
|
| 237 |
+
- Multi-seed runs: AWS EC2 `g4dn.xlarge` (NVIDIA T4 16GB)
|
| 238 |
+
|
| 239 |
+
---
|
| 240 |
+
|
| 241 |
+
## Citation
|
| 242 |
+
|
| 243 |
+
```bibtex
|
| 244 |
+
@inproceedings{santillana2026vectrayx,
|
| 245 |
+
title = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model
|
| 246 |
+
with Curriculum Learning and Native Tool Use},
|
| 247 |
+
author = {Santillana, Juan S.},
|
| 248 |
+
booktitle = {Preprint},
|
| 249 |
+
year = {2026}
|
| 250 |
+
}
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## License
|
| 256 |
+
|
| 257 |
+
| Component | License |
|
| 258 |
+
|---|---|
|
| 259 |
+
| Training code | MIT |
|
| 260 |
+
| Evaluation datasets (B1–B5) | CC-BY-4.0 |
|
| 261 |
+
| Model weights | Apache 2.0 |
|
| 262 |
+
| Paper | CC-BY-4.0 |
|