VectraYX-Nano

VectraYX-Nano is a 42M-parameter Spanish cybersecurity language model trained from scratch with curriculum learning and native Model Context Protocol (MCP) tool use. It is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

arXiv Zenodo


Released Model: VectraYX-Nano v7 (Headline)

VectraYX-Nano v7 is the released headline model. It uses the same 42M architecture and three-phase curriculum pre-training as the v2 bootstrap-ablation reference, with the SFT corpus rebalanced to a tool-use ratio of 1:21 (vs. 1:211 in v2). This single change raises B4 (tool-selection) from 0.000 to 0.230 ± 0.052 across N=4 seeds while retaining strong CVE recall (B1=0.332±0.005) and conversational quality (B5=0.725±0.130).

Files in this repo:

File Description
nano_sft_v7_s42.pt Nano v7 seed 42 — recommended for inference
nano_sft_v5.pt Nano v2 (mixed SFT, bootstrap-ablation reference)
vectrayx-nano-f16.gguf F16 GGUF — run with llama.cpp / Ollama
lora/nano_lora_mini_s{42,7,13,23}.pt LoRA adapters (tool-use density study, ratio 1:21)
tokenizer/vectrayx_bpe.model BPE-16384 tokenizer
configs/nano.json Nano 42M architecture config
configs/base.json Base 260M architecture config

Key Results (VectraYX-Bench, N=4 seeds)

Model Params B1 KW B2 F1† B3 TM B4 Tool B5 Chat
VectraYX-Nano v7 (headline) 42M 0.332±0.005 0.230±0.052 0.725±0.130
VectraYX-Nano v2 (bootstrap ablation) 42M 0.226±0.065 0.199±0.004 0.029±0.035 0.000 0.775±0.043
Nano LoRA mini (ratio 1:21, N=4) 42M 0.011±0.004 0.201±0.002 0.021±0.012 0.145±0.046 0.575±0.043
SmolLM2-135M + LoRA-32 135M 0.334 0.225 0.143 0.160 0.800
VectraYX-Base 260M 260M 0.325 0.220 0.114 0.000 0.800
Base 260M LoRA mini (ratio 1:21, N=4) 260M 0.019±0.003 0.203±0.002 0.445±0.201 0.600
VectraYX-Pro 3B 3.2B 0.341 0.695 0.686 0.600 0.800
VectraYX-Pro 7B 7B 0.335 0.815 0.686 0.880 0.800
GPT-4o (frontier reference) 0.333 0.110 0.520 0.615 0.631

†B2 is a benchmark artifact in this revision (key mismatch in harness, fix queued).

B5 inversion: Nano v7 (0.725±0.130) and Nano v2 (0.775±0.043) both exceed GPT-4o (0.631) on the 314-prompt held-out chat suite — the register-matched bootstrap corpus makes conversational Spanish the model's first language.


Key Findings

1. Loss-vs-register inversion. A higher-perplexity bootstrap corpus (OpenSubtitles-ES) yields better post-SFT chat behavior than a lower-perplexity alternative (mC4-ES). At the nano scale, the bootstrap corpus dictates the model's default response style; SFT cannot fully overwrite it.

2. Tool-use is corpus-density-gated, not capacity-gated. The B4=0.000 floor in the mixed SFT (ratio 1:211) is a corpus-density artifact. Rebalancing to 1:21 (2,801 tool-use examples) shifts the first-token prior to <|tool_call|> and raises B4 to 0.230±0.052 at 42M — without retraining the backbone.


Inference: llama.cpp / Ollama (GGUF)

# With Ollama
ollama run hf.co/jsantillana/vectrayx-nano:vectrayx-nano-f16.gguf

# With llama.cpp
./llama-cli -m vectrayx-nano-f16.gguf \
  --chat-template llama3 \
  -p "<|system|>Eres VectraYX, asistente experto en ciberseguridad para LATAM.<|end|>" \
  -i

Runs at 6–10 tok/s on Raspberry Pi 4 and 60–100 tok/s on a laptop CPU.


Inference: PyTorch

from huggingface_hub import hf_hub_download
import torch, json, sys

sys.path.insert(0, ".")  # needs training/transformer.py from vectrayx-paper-code

ckpt = hf_hub_download("jsantillana/vectrayx-nano", "nano_sft_v7_s42.pt")
tok  = hf_hub_download("jsantillana/vectrayx-nano", "tokenizer/vectrayx_bpe.model")
cfg  = hf_hub_download("jsantillana/vectrayx-nano", "configs/nano.json")

Full inference script at vectrayx-paper-code.


Training Details

Component Details
Parameters 41.95M
Architecture Transformer decoder, GQA (8q/2kv), QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss
Tokenizer BPE-16384, byte-fallback, 50/50 conv/tech balance
Pre-training 170M tokens, 3-phase curriculum with 25% replay buffer
SFT (v7) 13K OASST1-ES + 4K CVE Q&A + 2.8K tool-use (ratio 1:21)
Hardware GCP L4 24GB (pre-training) + AWS g4dn.xlarge T4 16GB (multi-seed SFT)
Cost ~$29 USD total (corpus + training)

Citation

@misc{santillana2026vectrayx,
  title     = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model
               with Curriculum Learning and Native Tool Use},
  author    = {Santillana, Juan S.},
  year      = {2026},
  eprint    = {2605.13989},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://arxiv.org/abs/2605.13989}
}
Downloads last month
124
GGUF
Model size
50.3M params
Architecture
llama
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for jsantillana/vectrayx-nano