Update README.md

ae4030b verified about 19 hours ago

9.69 kB

	# VectraYX — Reproducibility Release

	Paper: VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

	This repository contains the code, datasets, and pre-computed results needed to reproduce the key experiments from the paper.
	- Author website: https://jsantillana.com

	---

	## Repository Structure

	```
	release/
	├── Makefile ← make repro / make bench-nano / make lora-nano
	├── requirements.txt ← exact package versions
	├── configs/
	│ ├── nano.json ← Nano 42M architecture (GQA 8q/2kv, d_model=512)
	│ └── base.json ← Base 260M architecture (GQA 16q/4kv, d_model=1024)
	├── training/
	│ ├── transformer.py ← VectraYXNano model (GQA + QK-Norm + Z-loss + RoPE)
	│ ├── pretrain.py ← 3-phase curriculum pre-training driver
	│ ├── finetune_sft.py ← SFT with assistant-only loss masking + mini-curriculum
	│ ├── finetune_lora_tools.py ← LoRA adapter injection + merge (key experiment)
	│ ├── finetune_tools.py ← Full fine-tune (baseline comparison)
	│ ├── sft_dataset.py ← JSONL → tokenized dataset with loss masking
	│ ├── utils.py ← AdamW, cosine LR, checkpoint save/load
	│ ├── aws_lora_nano_tools_s3.py ← SageMaker launcher: Nano LoRA (S3-only)
	│ └── aws_lora_base_tools_s3.py ← SageMaker launcher: Base LoRA (S3-only)
	├── eval/
	│ ├── benchmark.py ← VectraYX-Bench B1–B5 harness
	│ ├── run_inference_lora.py ← Inference with LoRA adapter loaded
	│ ├── run_inference_base.py ← Inference with base checkpoint
	│ └── red_team_eval.py ← Adversarial probe script
	├── eval_data/
	│ ├── b1_cveqa.jsonl ← 500 CVE Q&A prompts + expected keywords
	│ ├── b2_classification.jsonl ← 200 threat classification examples
	│ ├── b3_commands.jsonl ← 35 command-line completion prompts
	│ ├── b4_tooluse.jsonl ← 25 tool-selection prompts (v2: 50 prompts)
	│ └── b5_conversational.jsonl ← 10 conversational gate prompts
	├── corpus/
	│ ├── tool_sft_mini_v1.jsonl ← 2,801 tool-use examples (ratio 1:21) ← KEY
	│ ├── tool_sft_v3_bash.jsonl ← 296 bash-focused examples
	│ ├── tool_sft_v2_simple.jsonl ← 115 simple bash examples
	│ ├── b4_tooluse_v2.jsonl ← B4 benchmark v2 (50 questions, 60% bash)
	│ ├── build_mini_tool_corpus.py ← Regenerate tool_sft_mini_v1 from scratch
	│ ├── build_tool_sft_corpus.py ← Full tool-use corpus generator
	│ └── build_v3_and_bench.py ← v3 corpus + benchmark builder
	├── results/
	│ ├── bench_nano_baseline_multiseed.json ← Nano baseline N=4 seeds (paper Table 2)
	│ ├── bench_nano_lora_multiseed.json ← Nano LoRA N=4 seeds (paper Table 3)
	│ └── bench_base_lora_s42.json ← Base LoRA seed=42 (paper Table 3)
	└── paper/
	└── main.pdf ← Paper PDF
	```

	---

	## Key Finding: Tool-Use Corpus Density

	The B4=0.000 floor in mixed SFT is a corpus-density artifact, not a capacity gate.

	\| Model \| Corpus \| Ratio \| B4 \|
	\|---\|---\|---\|---\|
	\| Nano 42M (mixed SFT, N=4 seeds) \| 62K examples \| 1:211 \| 0.000 \|
	\| Nano 42M + LoRA (N=4 seeds) \| 2,801 examples \| 1:21 \| 0.145 ± 0.046 \|
	\| Base 260M (mixed SFT) \| 62K examples \| 1:211 \| 0.000 \|
	\| Base 260M + LoRA \| 2,801 examples \| 1:21 \| 0.580 \|
	\| Pro 3B + LoRA-64 \| 62K examples \| ~1:10 \| 0.600 \|
	\| Pro 7B + QLoRA-32 \| 62K examples \| ~1:10 \| 0.880 \|

	### Nano LoRA Multi-Seed Results (N=4, Table 3 in paper)

	\| Seed \| B1 KW \| B2 F1 \| B3 TM \| B4 \| B5 \|
	\|------\|-------\|-------\|-------\|--------\|-----\|
	\| 42 \| 0.008 \| 0.200 \| 0.029 \| 0.220 \| 0.500 \|
	\| 7 \| 0.017 \| 0.200 \| 0.029 \| 0.140 \| 0.600 \|
	\| 13 \| 0.006 \| 0.200 \| 0.000 \| 0.120 \| 0.600 \|
	\| 23 \| 0.014 \| 0.205 \| 0.029 \| 0.100 \| 0.600 \|
	\| Mean ± std \| 0.011 ± 0.004 \| 0.201 ± 0.002 \| 0.021 ± 0.012 \| 0.145 ± 0.046 \| 0.575 ± 0.043 \|

	---

	## Quick Start

	### 1. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Download checkpoints

	```bash
	mkdir -p checkpoints
	# From HuggingFace (links TBD — see paper for GCS paths)
	# Nano 42M post-SFT (503 MB)
	# wget https://huggingface.co/vectrayx/nano-sft-v5/resolve/main/nano_sft_v5.pt \
	# -O checkpoints/nano_sft_v5.pt
	# Base 260M post-Phase3 (3.1 GB)
	# wget https://huggingface.co/vectrayx/base-phase3/resolve/main/base_phase3_last.pt \
	# -O checkpoints/base_phase3_last.pt
	# Tokenizer (474 KB)
	# wget https://huggingface.co/vectrayx/tokenizer/resolve/main/vectrayx_bpe.model \
	# -O checkpoints/vectrayx_bpe.model
	```

	### 3. Run the full reproducibility suite

	```bash
	make repro
	```

	This runs:
	1. `make bench-nano` — B1–B5 on Nano baseline (expected B4=0.000)
	2. `make bench-base` — B1–B5 on Base baseline (expected B4=0.000)
	3. `make lora-nano` — LoRA fine-tune Nano + eval (expected B4≈0.220 for seed=42)
	4. `make lora-base` — LoRA fine-tune Base + eval (expected B4≈0.580 for seed=42)

	### 4. Run individual experiments

	```bash
	# Benchmark only (no training)
	make bench-nano
	make bench-base

	# LoRA fine-tune + benchmark
	make lora-nano # ~30 min on A10G
	make lora-base # ~45 min on A10G

	# Regenerate corpus
	make corpus
	```

	---

	## Reproducing the Pre-Training Pipeline

	The full from-scratch pre-training pipeline (Phases 1–3 + SFT) is described in `training_v2/README.md` in the main repository. The key entry points are:

	```bash
	# 1. Train tokenizer (BPE-16384, 50/50 conv/tech balance)
	python -m training.tokenizer.train_spm_bpe \
	--config configs/nano.json \
	--corpus-root /path/to/corpus \
	--out-dir checkpoints/tokenizer

	# 2. Tokenize corpus → binary shards
	python -m training.data.prepare_corpus \
	--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
	--corpus-root /path/to/corpus \
	--out-root data/bins

	# 3. Pre-train (3 phases with replay buffer)
	python training/pretrain.py --config configs/nano.json \
	--bins data/bins --out checkpoints --phase 1 \
	--batch-size 16 --grad-accum 8 --epochs 2
	python training/pretrain.py --config configs/nano.json \
	--bins data/bins --out checkpoints --phase 2 \
	--resume checkpoints/phase1/last.pt
	python training/pretrain.py --config configs/nano.json \
	--bins data/bins --out checkpoints --phase 3 \
	--resume checkpoints/phase2/last.pt

	# 4. SFT with mini-curriculum
	python training/finetune_sft.py \
	--config configs/nano.json \
	--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
	--resume checkpoints/phase3/last.pt \
	--out checkpoints/sft_v5 \
	--batch-size 16 --grad-accum 4 --epochs 3 --lr 2e-5

	# 5. Benchmark
	python eval/benchmark.py \
	--config configs/nano.json \
	--tokenizer checkpoints/tokenizer/vectrayx_bpe.model \
	--checkpoint checkpoints/sft_v5/final.pt \
	--data-dir eval_data \
	--out results/bench_nano_baseline.json
	```

	Estimated cost: ~$12 USD on GCP L4 for 3 full runs (v2/v4/v6 ablations).

	---

	## SageMaker Experiments (LoRA)

	The LoRA experiments were run on AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB).

	```bash
	# Prerequisites: AWS CLI configured, S3 bucket with assets
	# See training/aws_lora_nano_tools_s3.py for full setup

	# Upload assets to S3
	aws s3 cp checkpoints/nano_sft_v5.pt s3://YOUR_BUCKET/checkpoints/
	aws s3 cp checkpoints/vectrayx_bpe.model s3://YOUR_BUCKET/tokenizers/
	aws s3 cp corpus/tool_sft_mini_v1.jsonl s3://YOUR_BUCKET/training-data/

	# Launch Nano LoRA (seed=42)
	bash corpus/launch_nano_lora_mini_ondemand.sh

	# Launch Base LoRA (seed=42)
	bash corpus/launch_base_lora_mini_ondemand.sh
	```

	Estimated cost per run: ~$1.50 USD (ml.g5.xlarge on-demand, ~45 min).

	---

	## Model Checkpoints

	\| Checkpoint \| Size \| Description \| Link \|
	\|---\|---\|---\|---\|
	\| `nano_sft_v5.pt` \| 503 MB \| Nano 42M post-SFT (base for LoRA) \| HuggingFace (TBD) \|
	\| `nano_lora_mini_s42.pt` \| ~5 MB \| Nano LoRA adapter (seed=42) \| HuggingFace (TBD) \|
	\| `base_phase3_last.pt` \| 3.1 GB \| Base 260M post-Phase3 (base for LoRA) \| HuggingFace (TBD) \|
	\| `base_lora_mini_s42.pt` \| ~20 MB \| Base LoRA adapter (seed=42) \| HuggingFace (TBD) \|
	\| `vectrayx_bpe.model` \| 474 KB \| BPE-16384 tokenizer \| HuggingFace (TBD) \|

	---

	## Environment

	Experiments were run with:

	\| Package \| Version \|
	\|---\|---\|
	\| Python \| 3.10 \|
	\| PyTorch \| 2.11.0 \|
	\| sentencepiece \| 0.2.1 \|
	\| numpy \| 2.4.2 \|
	\| CUDA \| 12.1 \|
	\| boto3 \| 1.42.93 \|
	\| sagemaker \| 3.10.0 \|

	Hardware:
	- Pre-training: GCP `g2-standard-4` (NVIDIA L4 24GB), `us-west1-a`
	- LoRA experiments: AWS SageMaker `ml.g5.xlarge` (NVIDIA A10G 24GB), `us-east-1`
	- Multi-seed runs: AWS EC2 `g4dn.xlarge` (NVIDIA T4 16GB)

	---

	## Citation

	```bibtex
	@inproceedings{santillana2026vectrayx,
	title = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model
	with Curriculum Learning and Native Tool Use},
	author = {Santillana, Juan S.},
	booktitle = {Preprint},
	year = {2026}
	}
	```

	---

	## License

	\| Component \| License \|
	\|---\|---\|
	\| Training code \| MIT \|
	\| Evaluation datasets (B1–B5) \| CC-BY-4.0 \|
	\| Model weights \| Apache 2.0 \|
	\| Paper \| CC-BY-4.0 \|