| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: text-generation |
| datasets: |
| - PleIAs/SYNTH |
| - HuggingFaceFW/fineweb-edu |
| tags: |
| - seqcond |
| - hybrid |
| - reasoning |
| - spectral |
| - trickstr |
| library_name: transformers |
| --- |
| |
| # Nautile-370M |
|
|
| <p align="center"> |
| <img src="assets/cover.png" alt="Nautile-370M cover" width="100%" /> |
| </p> |
|
|
| **Nautile-370M** is a 371M-parameter hybrid language model for reasoning and language understanding. |
|
|
| Its backbone alternates two *SeqCond Attention* (SCA) layers — a spectral sequence operator grounded in the derivative of the empirical characteristic function — with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers. |
|
|
| The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark. |
|
|
| A technical report is available on arXiv: [2604.24809](https://arxiv.org/abs/2604.24809). |
|
|
| --- |
|
|
| ## Architecture |
|
|
| The backbone repeats the pattern **SCA → SCA → Transformer** eight times (24 layers total). |
|
|
| <table> |
| <tbody> |
| <tr><td><strong>Parameters</strong></td><td>371M</td></tr> |
| <tr><td><strong>Layers</strong></td><td>24 (16 SCA + 8 Transformer)</td></tr> |
| <tr><td><strong>Model dimension</strong></td><td>1024</td></tr> |
| <tr><td><strong>FF dimension</strong></td><td>2730</td></tr> |
| <tr><td><strong>Context length</strong></td><td>4096</td></tr> |
| <tr><td><strong>Tokenizer</strong></td><td><code>cl100k_base</code> (tiktoken)</td></tr> |
| <tr><td><strong>Weight tying</strong></td><td>Yes</td></tr> |
| <tr><td><strong>Dtype</strong></td><td>bfloat16</td></tr> |
| </tbody> |
| </table> |
| |
| **SCA layers** maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case. |
|
|
| **Transformer layers** (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads). |
|
|
| --- |
|
|
| ## Intended use |
|
|
| Nautile-370M is designed for **language understanding, common-sense reasoning, and classification tasks**: |
|
|
| - Sentiment analysis, intent detection, topic labeling |
| - Structured information extraction |
| - Fine-tuning for domain-specific classification |
| - Large-scale opinion modeling (thousands of instances in parallel on modest hardware) |
|
|
| It is **not designed** for open-ended multi-turn chat, code generation, or knowledge-intensive QA — tasks that benefit more from scale than from architectural efficiency at this parameter count. |
|
|
| --- |
|
|
| ## Benchmarks |
|
|
| 0-shot evaluation against models of similar size: |
|
|
| <p align="center"> |
| <img src="assets/benchmark.png" alt="Benchmark comparison" width="100%" /> |
| </p> |
|
|
| <table> |
| <thead> |
| <tr style="border-bottom: 2px solid #e6ff55; text-transform: uppercase; letter-spacing: 0.04em; font-size: 0.85em;"> |
| <th align="left">Benchmark</th> |
| <th align="center" style="color:#e6ff55;">Nautile-370M</th> |
| <th align="center">Qwen2.5-0.5B</th> |
| <th align="center">Granite-350M</th> |
| <th align="center">LFM2.5-350M</th> |
| <th align="center">SmolLM2-360M</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr><td>Training tokens</td><td align="center" style="border-left: 3px solid #e6ff55;">~0.8T</td><td align="center">18T</td><td align="center">10–12T</td><td align="center">28T</td><td align="center">4T</td></tr> |
| <tr><td>OpenBookQA</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>49.3</strong></td><td align="center">34.4</td><td align="center">31.6</td><td align="center">26.4</td><td align="center">24.2</td></tr> |
| <tr><td>ARC</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>57.0</strong></td><td align="center">50.1</td><td align="center">30.0</td><td align="center">32.8</td><td align="center">43.7</td></tr> |
| <tr><td>CommonsenseQA</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>46.8</strong></td><td align="center">46.5</td><td align="center">36.2</td><td align="center">44.3</td><td align="center">18.4</td></tr> |
| <tr><td>GSM8K</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>33.4</strong></td><td align="center">28.3</td><td align="center">31.5</td><td align="center">33.0</td><td align="center">7.4</td></tr> |
| <tr><td>PIQA</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>61.5</strong></td><td align="center">61.3</td><td align="center">50.8</td><td align="center">49.5</td><td align="center">48.2</td></tr> |
| <tr><td>IFEval</td><td align="center" style="border-left: 3px solid #e6ff55;">36.9</td><td align="center">31.6</td><td align="center">55.4</td><td align="center"><strong>62.4</strong></td><td align="center">41.0</td></tr> |
| <tr><td>TriviaQA</td><td align="center" style="border-left: 3px solid #e6ff55;">23.8</td><td align="center">27.8</td><td align="center">25.2</td><td align="center">22.9</td><td align="center"><strong>28.0</strong></td></tr> |
| <tr><td>MATH500</td><td align="center" style="border-left: 3px solid #e6ff55;">2.4</td><td align="center"><strong>18.8</strong></td><td align="center">5.6</td><td align="center">12.2</td><td align="center">0.0</td></tr> |
| <tr><td>MMLU-Pro</td><td align="center" style="border-left: 3px solid #e6ff55;">14.9</td><td align="center">14.3</td><td align="center">11.2</td><td align="center"><strong>18.6</strong></td><td align="center">10.3</td></tr> |
| <tr><td>MMLU</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>39.2</strong></td><td align="center">33.7</td><td align="center">35.0</td><td align="center">39.1</td><td align="center">35.8</td></tr> |
| <tr><td>GPQA Diamond</td><td align="center" style="border-left: 3px solid #e6ff55;"><strong>27.3</strong></td><td align="center">10.1</td><td align="center">26.3</td><td align="center">24.8</td><td align="center">23.2</td></tr> |
| <tr style="border-top: 1px solid #e6ff55;"> |
| <td><strong>Average</strong></td> |
| <td align="center" style="border-left: 3px solid #e6ff55; color:#e6ff55;"><strong>35.7</strong></td> |
| <td align="center">32.4</td> |
| <td align="center">30.8</td> |
| <td align="center">33.3</td> |
| <td align="center">25.5</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <p align="center"> |
| <img src="assets/score_vs_tokens.png" alt="Average benchmark score versus training tokens" width="100%" /> |
| </p> |
|
|
| All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10–28T for the other models in this table. |
|
|
| --- |
|
|
| ## Generation quality — LLM-as-a-judge |
|
|
| We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination. |
|
|
| <p align="center"> |
| <img src="assets/llm_judge.png" alt="LLM-as-a-judge win rate comparison" width="100%" /> |
| </p> |
|
|
| <table> |
| <thead> |
| <tr style="border-bottom: 2px solid #e6ff55; text-transform: uppercase; letter-spacing: 0.04em; font-size: 0.85em;"> |
| <th align="left">Comparison</th> |
| <th align="center" style="color:#e6ff55;">Nautile-370M wins</th> |
| <th align="center">Reference wins</th> |
| <th align="center">Tie</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr> |
| <td>vs LFM2.5-350M</td> |
| <td align="center" style="border-left: 3px solid #e6ff55;"><strong>57%</strong></td> |
| <td align="center">42%</td> |
| <td align="center">1%</td> |
| </tr> |
| <tr> |
| <td>vs Granite-350M</td> |
| <td align="center" style="border-left: 3px solid #e6ff55;"><strong>63%</strong></td> |
| <td align="center">35%</td> |
| <td align="center">2%</td> |
| </tr> |
| <tr> |
| <td>vs Qwen2.5-0.5B</td> |
| <td align="center" style="border-left: 3px solid #e6ff55;"><strong>74%</strong></td> |
| <td align="center">22%</td> |
| <td align="center">4%</td> |
| </tr> |
| <tr> |
| <td>vs SmolLM2-360M</td> |
| <td align="center" style="border-left: 3px solid #e6ff55;"><strong>63%</strong></td> |
| <td align="center">36%</td> |
| <td align="center">1%</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| --- |
|
|
| ## Quick start |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "trickstr-ai/nautile-370m", |
| trust_remote_code=True, |
| dtype=torch.bfloat16, |
| ).cuda().eval() |
| tokenizer = AutoTokenizer.from_pretrained( |
| "trickstr-ai/nautile-370m", |
| trust_remote_code=True, |
| ) |
| |
| # encode_chat() wraps your prompt in the think-then-answer template. |
| input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda() |
| |
| # acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton |
| # kernels automatically if the triton package is installed. |
| # CUDA graphs are captured on the first generate() call (~2 s overhead once). |
| output = model.generate( |
| input_ids, |
| max_new_tokens=512, |
| temperature=0.15, |
| top_p=0.9, |
| top_k=50, |
| repetition_penalty=1.1, |
| # acceleration="auto", # default — cuda_graph + triton if available |
| # acceleration="cuda_graph", # cuda graph only |
| # acceleration="none", # plain PyTorch, no graph capture |
| ) |
| print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist())) |
| ``` |
|
|
| --- |
|
|
| ## Chat template |
|
|
| The model uses a **ChatML** format with a chain-of-thought section delimited by `<|think_start|>` / `<|think_end|>`: |
|
|
| ``` |
| <|im_start|>user |
| {prompt} |
| <|im_end|><|im_start|>assistant |
| <|think_start|>{chain of thought}<|think_end|> |
| {answer} |
| <|im_end|> |
| ``` |
|
|
| | Token | ID | |
| |---|---| |
| | `<\|im_start\|>` | 100278 | |
| | `<\|im_end\|>` | 100279 | |
| | `<\|think_start\|>` | 100280 | |
| | `<\|think_end\|>` | 100281 | |
|
|
| You can also use `apply_chat_template` for multi-turn conversations: |
|
|
| ```python |
| messages = [{"role": "user", "content": "What is rapamycin?"}] |
| ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True) |
| ``` |
|
|
| Recommended generation parameters: `temperature=0.15`, `top_p=0.9`, `top_k=50`, `repetition_penalty=1.1`. |
|
|
| --- |
|
|
| ## Inference speed |
|
|
| All measurements are **out-of-the-box via the Hugging Face `transformers` library** only — not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU. |
|
|
| <p align="center"> |
| <img src="assets/tokens_per_second.png" alt="Out-of-the-box Hugging Face Transformers inference speed comparison" width="100%" /> |
| </p> |
|
|
| <table> |
| <thead> |
| <tr style="border-bottom: 2px solid #e6ff55; text-transform: uppercase; letter-spacing: 0.04em; font-size: 0.85em;"> |
| <th align="left">Model</th> |
| <th align="center">tok/s</th> |
| </tr> |
| </thead> |
| <tbody> |
| <tr><td style="border-left: 3px solid #e6ff55; color:#e6ff55;"><strong>Nautile-370M</strong> (Triton kernel)</td><td align="center" style="color:#e6ff55;"><strong>125.9</strong></td></tr> |
| <tr><td style="border-left: 3px solid #e6ff55; color:#e6ff55;"><strong>Nautile-370M</strong></td><td align="center" style="color:#e6ff55;"><strong>108.3</strong></td></tr> |
| <tr><td>LFM2.5-350M</td><td align="center">72.9</td></tr> |
| <tr><td>Qwen2.5-0.5B</td><td align="center">44.5</td></tr> |
| <tr><td>SmolLM2-360M</td><td align="center">33.8</td></tr> |
| <tr><td>Baguettotron</td><td align="center">14.4</td></tr> |
| </tbody> |
| </table> |
| |
| `acceleration="auto"` (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first `generate()` call (~2 s overhead) and reused for all subsequent calls. |
|
|
| --- |
|
|
| ## Training |
|
|
| **Stage 1 — Pretraining** (~350B tokens, TPU v4-64): |
| FineWeb-Edu for broad factual and linguistic coverage. |
|
|
| **Stage 2 — Supervised fine-tuning** (~250B tokens, TPU v4-64): |
| SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment. |
|
|
| **Stage 3 — Reinforcement learning** (DGX Spark): |
| Three-stage RL pipeline: |
| 1. *Dr. GRPO* with LLM-judge rewards for format alignment |
| 2. *Gradient-balanced GRPO* — decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K |
| 3. *Scored self-distillation* — fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K |
|
|
| | Stage | GSM8K | |
| |---|---| |
| | After SFT | 27.98% | |
| | + Dr. GRPO | 28.96% | |
| | + Gradient-balanced GRPO | 31.36% | |
| | + Scored self-distillation | **33.43%** | |
|
|
| --- |
|
|
| ## `trust_remote_code=True` |
|
|
| This model ships with custom modeling and tokenizer code. Pass `trust_remote_code=True` to `from_pretrained` calls. The relevant files (`modeling_seqcond.py`, `tokenization_seqcond.py`, `configuration_seqcond.py`) are included in this repository. |
|
|
| --- |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{chenebaux2025nautile, |
| title = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model}, |
| author = {Chenebaux, Maixent}, |
| year = {2025}, |
| note = {arXiv preprint, link coming soon} |
| } |
| ``` |
|
|