--- datasets: - vectrayx/vectrayx-bench language: - es license: apache-2.0 metrics: - accuracy - f1 pipeline_tag: text-generation tags: - cybersecurity - spanish - tool-use - mcp - curriculum-learning - from-scratch - arxiv:2605.13989 --- # VectraYX-Nano VectraYX-Nano is a 42M-parameter Spanish cybersecurity language model trained **from scratch** with curriculum learning and native [Model Context Protocol (MCP)](https://modelcontextprotocol.io) tool use. It is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration. [![arXiv](https://img.shields.io/badge/arXiv-2605.13989-b31b1b.svg)](https://arxiv.org/abs/2605.13989) [![Zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.20122226.svg)](https://doi.org/10.5281/zenodo.20122226) - **Paper:** [VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use](https://arxiv.org/abs/2605.13989) - **Repository:** [vectrayx/vectrayx-nano-paper](https://github.com/vectrayx/vectrayx-nano-paper) - **arXiv DOI:** https://doi.org/10.48550/arXiv.2605.13989 - **Author website:** https://jsantillana.com --- ## Released Model: VectraYX-Nano v7 (Headline) **VectraYX-Nano v7** is the released headline model. It uses the same 42M architecture and three-phase curriculum pre-training as the v2 bootstrap-ablation reference, with the SFT corpus rebalanced to a tool-use ratio of 1:21 (vs. 1:211 in v2). This single change raises B4 (tool-selection) from 0.000 to **0.230 ± 0.052** across N=4 seeds while retaining strong CVE recall (B1=0.332±0.005) and conversational quality (B5=0.725±0.130). Files in this repo: | File | Description | |---|---| | `nano_sft_v7_s42.pt` | **Nano v7 seed 42 — recommended for inference** | | `nano_sft_v5.pt` | Nano v2 (mixed SFT, bootstrap-ablation reference) | | `vectrayx-nano-f16.gguf` | **F16 GGUF — run with llama.cpp / Ollama** | | `lora/nano_lora_mini_s{42,7,13,23}.pt` | LoRA adapters (tool-use density study, ratio 1:21) | | `tokenizer/vectrayx_bpe.model` | BPE-16384 tokenizer | | `configs/nano.json` | Nano 42M architecture config | | `configs/base.json` | Base 260M architecture config | --- ## Key Results (VectraYX-Bench, N=4 seeds) | Model | Params | B1 KW | B2 F1† | B3 TM | B4 Tool | B5 Chat | |---|---|---|---|---|---|---| | **VectraYX-Nano v7** *(headline)* | 42M | **0.332±0.005** | — | — | **0.230±0.052** | 0.725±0.130 | | VectraYX-Nano v2 *(bootstrap ablation)* | 42M | 0.226±0.065 | 0.199±0.004 | 0.029±0.035 | 0.000 | **0.775±0.043** | | Nano LoRA mini (ratio 1:21, N=4) | 42M | 0.011±0.004 | 0.201±0.002 | 0.021±0.012 | 0.145±0.046 | 0.575±0.043 | | SmolLM2-135M + LoRA-32 | 135M | 0.334 | 0.225 | 0.143 | 0.160 | 0.800 | | VectraYX-Base 260M | 260M | 0.325 | 0.220 | 0.114 | 0.000 | 0.800 | | Base 260M LoRA mini (ratio 1:21, N=4) | 260M | 0.019±0.003 | 0.203±0.002 | — | 0.445±0.201 | 0.600 | | VectraYX-Pro 3B | 3.2B | 0.341 | 0.695 | 0.686 | 0.600 | 0.800 | | VectraYX-Pro 7B | 7B | 0.335 | 0.815 | 0.686 | 0.880 | 0.800 | | GPT-4o *(frontier reference)* | — | 0.333 | 0.110 | 0.520 | 0.615 | 0.631 | †B2 is a benchmark artifact in this revision (key mismatch in harness, fix queued). **B5 inversion:** Nano v7 (0.725±0.130) and Nano v2 (0.775±0.043) both **exceed GPT-4o (0.631)** on the 314-prompt held-out chat suite — the register-matched bootstrap corpus makes conversational Spanish the model's first language. --- ## Key Findings **1. Loss-vs-register inversion.** A higher-perplexity bootstrap corpus (OpenSubtitles-ES) yields *better* post-SFT chat behavior than a lower-perplexity alternative (mC4-ES). At the nano scale, the bootstrap corpus dictates the model's default response style; SFT cannot fully overwrite it. **2. Tool-use is corpus-density-gated, not capacity-gated.** The B4=0.000 floor in the mixed SFT (ratio 1:211) is a corpus-density artifact. Rebalancing to 1:21 (2,801 tool-use examples) shifts the first-token prior to `<|tool_call|>` and raises B4 to 0.230±0.052 at 42M — without retraining the backbone. --- ## Inference: llama.cpp / Ollama (GGUF) ```bash # With Ollama ollama run hf.co/jsantillana/vectrayx-nano:vectrayx-nano-f16.gguf # With llama.cpp ./llama-cli -m vectrayx-nano-f16.gguf \ --chat-template llama3 \ -p "<|system|>Eres VectraYX, asistente experto en ciberseguridad para LATAM.<|end|>" \ -i ``` Runs at 6–10 tok/s on Raspberry Pi 4 and 60–100 tok/s on a laptop CPU. --- ## Inference: PyTorch ```python from huggingface_hub import hf_hub_download import torch, json, sys sys.path.insert(0, ".") # needs training/transformer.py from vectrayx-paper-code ckpt = hf_hub_download("jsantillana/vectrayx-nano", "nano_sft_v7_s42.pt") tok = hf_hub_download("jsantillana/vectrayx-nano", "tokenizer/vectrayx_bpe.model") cfg = hf_hub_download("jsantillana/vectrayx-nano", "configs/nano.json") ``` Full inference script at [vectrayx-paper-code](https://huggingface.co/jsantillana/vectrayx-paper-code). --- ## Training Details | Component | Details | |---|---| | Parameters | 41.95M | | Architecture | Transformer decoder, GQA (8q/2kv), QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss | | Tokenizer | BPE-16384, byte-fallback, 50/50 conv/tech balance | | Pre-training | 170M tokens, 3-phase curriculum with 25% replay buffer | | SFT (v7) | 13K OASST1-ES + 4K CVE Q&A + 2.8K tool-use (ratio 1:21) | | Hardware | GCP L4 24GB (pre-training) + AWS g4dn.xlarge T4 16GB (multi-seed SFT) | | Cost | ~$29 USD total (corpus + training) | --- ## Citation ```bibtex @misc{santillana2026vectrayx, title = {VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use}, author = {Santillana, Juan S.}, year = {2026}, eprint = {2605.13989}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2605.13989} } ```