--- license: apache-2.0 language: - en pipeline_tag: text-generation datasets: - PleIAs/SYNTH - HuggingFaceFW/fineweb-edu tags: - seqcond - hybrid - reasoning - spectral - trickstr library_name: transformers --- # Nautile-370M

Nautile-370M cover

**Nautile-370M** is a 371M-parameter hybrid language model for reasoning and language understanding. Its backbone alternates two *SeqCond Attention* (SCA) layers — a spectral sequence operator grounded in the derivative of the empirical characteristic function — with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers. The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark. A technical report is available on arXiv: [2604.24809](https://arxiv.org/abs/2604.24809). --- ## Architecture The backbone repeats the pattern **SCA → SCA → Transformer** eight times (24 layers total).

Parameters	371M
Layers	24 (16 SCA + 8 Transformer)
Model dimension	1024
FF dimension	2730
Context length	4096
Tokenizer	`cl100k_base` (tiktoken)
Weight tying	Yes
Dtype	bfloat16

**SCA layers** maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case. **Transformer layers** (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads). --- ## Intended use Nautile-370M is designed for **language understanding, common-sense reasoning, and classification tasks**: - Sentiment analysis, intent detection, topic labeling - Structured information extraction - Fine-tuning for domain-specific classification - Large-scale opinion modeling (thousands of instances in parallel on modest hardware) It is **not designed** for open-ended multi-turn chat, code generation, or knowledge-intensive QA — tasks that benefit more from scale than from architectural efficiency at this parameter count. --- ## Benchmarks 0-shot evaluation against models of similar size:

Benchmark comparison

Benchmark	Nautile-370M	Qwen2.5-0.5B	Granite-350M	LFM2.5-350M	SmolLM2-360M
Training tokens	~0.8T	18T	10–12T	28T	4T
OpenBookQA	49.3	34.4	31.6	26.4	24.2
ARC	57.0	50.1	30.0	32.8	43.7
CommonsenseQA	46.8	46.5	36.2	44.3	18.4
GSM8K	33.4	28.3	31.5	33.0	7.4
PIQA	61.5	61.3	50.8	49.5	48.2
IFEval	36.9	31.6	55.4	62.4	41.0
TriviaQA	23.8	27.8	25.2	22.9	28.0
MATH500	2.4	18.8	5.6	12.2	0.0
MMLU-Pro	14.9	14.3	11.2	18.6	10.3
MMLU	39.2	33.7	35.0	39.1	35.8
GPQA Diamond	27.3	10.1	26.3	24.8	23.2
Average	35.7	32.4	30.8	33.3	25.5

Average benchmark score versus training tokens

All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10–28T for the other models in this table. --- ## Generation quality — LLM-as-a-judge We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination.

LLM-as-a-judge win rate comparison

Comparison	Nautile-370M wins	Reference wins	Tie
vs LFM2.5-350M	57%	42%	1%
vs Granite-350M	63%	35%	2%
vs Qwen2.5-0.5B	74%	22%	4%
vs SmolLM2-360M	63%	36%	1%

--- ## Quick start ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model = AutoModelForCausalLM.from_pretrained( "trickstr-ai/nautile-370m", trust_remote_code=True, dtype=torch.bfloat16, ).cuda().eval() tokenizer = AutoTokenizer.from_pretrained( "trickstr-ai/nautile-370m", trust_remote_code=True, ) # encode_chat() wraps your prompt in the think-then-answer template. input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda() # acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton # kernels automatically if the triton package is installed. # CUDA graphs are captured on the first generate() call (~2 s overhead once). output = model.generate( input_ids, max_new_tokens=512, temperature=0.15, top_p=0.9, top_k=50, repetition_penalty=1.1, # acceleration="auto", # default — cuda_graph + triton if available # acceleration="cuda_graph", # cuda graph only # acceleration="none", # plain PyTorch, no graph capture ) print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist())) ``` --- ## Chat template The model uses a **ChatML** format with a chain-of-thought section delimited by `<|think_start|>` / `<|think_end|>`: ``` <|im_start|>user {prompt} <|im_end|><|im_start|>assistant <|think_start|>{chain of thought}<|think_end|> {answer} <|im_end|> ``` | Token | ID | |---|---| | `<\|im_start\|>` | 100278 | | `<\|im_end\|>` | 100279 | | `<\|think_start\|>` | 100280 | | `<\|think_end\|>` | 100281 | You can also use `apply_chat_template` for multi-turn conversations: ```python messages = [{"role": "user", "content": "What is rapamycin?"}] ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True) ``` Recommended generation parameters: `temperature=0.15`, `top_p=0.9`, `top_k=50`, `repetition_penalty=1.1`. --- ## Inference speed All measurements are **out-of-the-box via the Hugging Face `transformers` library** only — not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU.

Out-of-the-box Hugging Face Transformers inference speed comparison

Model	tok/s
Nautile-370M (Triton kernel)	125.9
Nautile-370M	108.3
LFM2.5-350M	72.9
Qwen2.5-0.5B	44.5
SmolLM2-360M	33.8
Baguettotron	14.4

`acceleration="auto"` (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first `generate()` call (~2 s overhead) and reused for all subsequent calls. --- ## Training **Stage 1 — Pretraining** (~350B tokens, TPU v4-64): FineWeb-Edu for broad factual and linguistic coverage. **Stage 2 — Supervised fine-tuning** (~250B tokens, TPU v4-64): SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment. **Stage 3 — Reinforcement learning** (DGX Spark): Three-stage RL pipeline: 1. *Dr. GRPO* with LLM-judge rewards for format alignment 2. *Gradient-balanced GRPO* — decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K 3. *Scored self-distillation* — fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K | Stage | GSM8K | |---|---| | After SFT | 27.98% | | + Dr. GRPO | 28.96% | | + Gradient-balanced GRPO | 31.36% | | + Scored self-distillation | **33.43%** | --- ## `trust_remote_code=True` This model ships with custom modeling and tokenizer code. Pass `trust_remote_code=True` to `from_pretrained` calls. The relevant files (`modeling_seqcond.py`, `tokenization_seqcond.py`, `configuration_seqcond.py`) are included in this repository. --- ## Citation ```bibtex @article{chenebaux2025nautile, title = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model}, author = {Chenebaux, Maixent}, year = {2025}, note = {arXiv preprint, link coming soon} } ```