---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
datasets:
- PleIAs/SYNTH
- HuggingFaceFW/fineweb-edu
tags:
- seqcond
- hybrid
- reasoning
- spectral
- trickstr
library_name: transformers
---
# Nautile-370M
**Nautile-370M** is a 371M-parameter hybrid language model for reasoning and language understanding.
Its backbone alternates two *SeqCond Attention* (SCA) layers — a spectral sequence operator grounded in the derivative of the empirical characteristic function — with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers.
The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark.
A technical report is available on arXiv: [2604.24809](https://arxiv.org/abs/2604.24809).
---
## Architecture
The backbone repeats the pattern **SCA → SCA → Transformer** eight times (24 layers total).
| Parameters | 371M |
| Layers | 24 (16 SCA + 8 Transformer) |
| Model dimension | 1024 |
| FF dimension | 2730 |
| Context length | 4096 |
| Tokenizer | cl100k_base (tiktoken) |
| Weight tying | Yes |
| Dtype | bfloat16 |
**SCA layers** maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case.
**Transformer layers** (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads).
---
## Intended use
Nautile-370M is designed for **language understanding, common-sense reasoning, and classification tasks**:
- Sentiment analysis, intent detection, topic labeling
- Structured information extraction
- Fine-tuning for domain-specific classification
- Large-scale opinion modeling (thousands of instances in parallel on modest hardware)
It is **not designed** for open-ended multi-turn chat, code generation, or knowledge-intensive QA — tasks that benefit more from scale than from architectural efficiency at this parameter count.
---
## Benchmarks
0-shot evaluation against models of similar size:
| Benchmark |
Nautile-370M |
Qwen2.5-0.5B |
Granite-350M |
LFM2.5-350M |
SmolLM2-360M |
| Training tokens | ~0.8T | 18T | 10–12T | 28T | 4T |
| OpenBookQA | 49.3 | 34.4 | 31.6 | 26.4 | 24.2 |
| ARC | 57.0 | 50.1 | 30.0 | 32.8 | 43.7 |
| CommonsenseQA | 46.8 | 46.5 | 36.2 | 44.3 | 18.4 |
| GSM8K | 33.4 | 28.3 | 31.5 | 33.0 | 7.4 |
| PIQA | 61.5 | 61.3 | 50.8 | 49.5 | 48.2 |
| IFEval | 36.9 | 31.6 | 55.4 | 62.4 | 41.0 |
| TriviaQA | 23.8 | 27.8 | 25.2 | 22.9 | 28.0 |
| MATH500 | 2.4 | 18.8 | 5.6 | 12.2 | 0.0 |
| MMLU-Pro | 14.9 | 14.3 | 11.2 | 18.6 | 10.3 |
| MMLU | 39.2 | 33.7 | 35.0 | 39.1 | 35.8 |
| GPQA Diamond | 27.3 | 10.1 | 26.3 | 24.8 | 23.2 |
| Average |
35.7 |
32.4 |
30.8 |
33.3 |
25.5 |
All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10–28T for the other models in this table.
---
## Generation quality — LLM-as-a-judge
We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination.
| Comparison |
Nautile-370M wins |
Reference wins |
Tie |
| vs LFM2.5-350M |
57% |
42% |
1% |
| vs Granite-350M |
63% |
35% |
2% |
| vs Qwen2.5-0.5B |
74% |
22% |
4% |
| vs SmolLM2-360M |
63% |
36% |
1% |
---
## Quick start
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"trickstr-ai/nautile-370m",
trust_remote_code=True,
dtype=torch.bfloat16,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(
"trickstr-ai/nautile-370m",
trust_remote_code=True,
)
# encode_chat() wraps your prompt in the think-then-answer template.
input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda()
# acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton
# kernels automatically if the triton package is installed.
# CUDA graphs are captured on the first generate() call (~2 s overhead once).
output = model.generate(
input_ids,
max_new_tokens=512,
temperature=0.15,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
# acceleration="auto", # default — cuda_graph + triton if available
# acceleration="cuda_graph", # cuda graph only
# acceleration="none", # plain PyTorch, no graph capture
)
print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist()))
```
---
## Chat template
The model uses a **ChatML** format with a chain-of-thought section delimited by `<|think_start|>` / `<|think_end|>`:
```
<|im_start|>user
{prompt}
<|im_end|><|im_start|>assistant
<|think_start|>{chain of thought}<|think_end|>
{answer}
<|im_end|>
```
| Token | ID |
|---|---|
| `<\|im_start\|>` | 100278 |
| `<\|im_end\|>` | 100279 |
| `<\|think_start\|>` | 100280 |
| `<\|think_end\|>` | 100281 |
You can also use `apply_chat_template` for multi-turn conversations:
```python
messages = [{"role": "user", "content": "What is rapamycin?"}]
ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
```
Recommended generation parameters: `temperature=0.15`, `top_p=0.9`, `top_k=50`, `repetition_penalty=1.1`.
---
## Inference speed
All measurements are **out-of-the-box via the Hugging Face `transformers` library** only — not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU.
| Model |
tok/s |
| Nautile-370M (Triton kernel) | 125.9 |
| Nautile-370M | 108.3 |
| LFM2.5-350M | 72.9 |
| Qwen2.5-0.5B | 44.5 |
| SmolLM2-360M | 33.8 |
| Baguettotron | 14.4 |
`acceleration="auto"` (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first `generate()` call (~2 s overhead) and reused for all subsequent calls.
---
## Training
**Stage 1 — Pretraining** (~350B tokens, TPU v4-64):
FineWeb-Edu for broad factual and linguistic coverage.
**Stage 2 — Supervised fine-tuning** (~250B tokens, TPU v4-64):
SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment.
**Stage 3 — Reinforcement learning** (DGX Spark):
Three-stage RL pipeline:
1. *Dr. GRPO* with LLM-judge rewards for format alignment
2. *Gradient-balanced GRPO* — decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K
3. *Scored self-distillation* — fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K
| Stage | GSM8K |
|---|---|
| After SFT | 27.98% |
| + Dr. GRPO | 28.96% |
| + Gradient-balanced GRPO | 31.36% |
| + Scored self-distillation | **33.43%** |
---
## `trust_remote_code=True`
This model ships with custom modeling and tokenizer code. Pass `trust_remote_code=True` to `from_pretrained` calls. The relevant files (`modeling_seqcond.py`, `tokenization_seqcond.py`, `configuration_seqcond.py`) are included in this repository.
---
## Citation
```bibtex
@article{chenebaux2025nautile,
title = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model},
author = {Chenebaux, Maixent},
year = {2025},
note = {arXiv preprint, link coming soon}
}
```