# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("trickstr-ai/nautile-370m", trust_remote_code=True, dtype="auto")Nautile-370M
Nautile-370M is a 371M-parameter hybrid language model for reasoning and language understanding.
Its backbone alternates two SeqCond Attention (SCA) layers β a spectral sequence operator grounded in the derivative of the empirical characteristic function β with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers.
The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark.
A technical report is available on arXiv: 2604.24809.
Architecture
The backbone repeats the pattern SCA β SCA β Transformer eight times (24 layers total).
| Parameters | 371M |
| Layers | 24 (16 SCA + 8 Transformer) |
| Model dimension | 1024 |
| FF dimension | 2730 |
| Context length | 4096 |
| Tokenizer | cl100k_base (tiktoken) |
| Weight tying | Yes |
| Dtype | bfloat16 |
SCA layers maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case.
Transformer layers (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads).
Intended use
Nautile-370M is designed for language understanding, common-sense reasoning, and classification tasks:
- Sentiment analysis, intent detection, topic labeling
- Structured information extraction
- Fine-tuning for domain-specific classification
- Large-scale opinion modeling (thousands of instances in parallel on modest hardware)
It is not designed for open-ended multi-turn chat, code generation, or knowledge-intensive QA β tasks that benefit more from scale than from architectural efficiency at this parameter count.
Benchmarks
0-shot evaluation against models of similar size:
| Benchmark | Nautile-370M | Qwen2.5-0.5B | Granite-350M | LFM2.5-350M | SmolLM2-360M |
|---|---|---|---|---|---|
| Training tokens | ~0.8T | 18T | 10β12T | 28T | 4T |
| OpenBookQA | 49.3 | 34.4 | 31.6 | 26.4 | 24.2 |
| ARC | 57.0 | 50.1 | 30.0 | 32.8 | 43.7 |
| CommonsenseQA | 46.8 | 46.5 | 36.2 | 44.3 | 18.4 |
| GSM8K | 33.4 | 28.3 | 31.5 | 33.0 | 7.4 |
| PIQA | 61.5 | 61.3 | 50.8 | 49.5 | 48.2 |
| IFEval | 36.9 | 31.6 | 55.4 | 62.4 | 41.0 |
| TriviaQA | 23.8 | 27.8 | 25.2 | 22.9 | 28.0 |
| MATH500 | 2.4 | 18.8 | 5.6 | 12.2 | 0.0 |
| MMLU-Pro | 14.9 | 14.3 | 11.2 | 18.6 | 10.3 |
| MMLU | 39.2 | 33.7 | 35.0 | 39.1 | 35.8 |
| GPQA Diamond | 27.3 | 10.1 | 26.3 | 24.8 | 23.2 |
| Average | 35.7 | 32.4 | 30.8 | 33.3 | 25.5 |
All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10β28T for the other models in this table.
Generation quality β LLM-as-a-judge
We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination.
| Comparison | Nautile-370M wins | Reference wins | Tie |
|---|---|---|---|
| vs LFM2.5-350M | 57% | 42% | 1% |
| vs Granite-350M | 63% | 35% | 2% |
| vs Qwen2.5-0.5B | 74% | 22% | 4% |
| vs SmolLM2-360M | 63% | 36% | 1% |
Quick start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"trickstr-ai/nautile-370m",
trust_remote_code=True,
dtype=torch.bfloat16,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(
"trickstr-ai/nautile-370m",
trust_remote_code=True,
)
# encode_chat() wraps your prompt in the think-then-answer template.
input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda()
# acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton
# kernels automatically if the triton package is installed.
# CUDA graphs are captured on the first generate() call (~2 s overhead once).
output = model.generate(
input_ids,
max_new_tokens=512,
temperature=0.15,
top_p=0.9,
top_k=50,
repetition_penalty=1.1,
# acceleration="auto", # default β cuda_graph + triton if available
# acceleration="cuda_graph", # cuda graph only
# acceleration="none", # plain PyTorch, no graph capture
)
print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist()))
Chat template
The model uses a ChatML format with a chain-of-thought section delimited by <|think_start|> / <|think_end|>:
<|im_start|>user
{prompt}
<|im_end|><|im_start|>assistant
<|think_start|>{chain of thought}<|think_end|>
{answer}
<|im_end|>
| Token | ID |
|---|---|
<|im_start|> |
100278 |
<|im_end|> |
100279 |
<|think_start|> |
100280 |
<|think_end|> |
100281 |
You can also use apply_chat_template for multi-turn conversations:
messages = [{"role": "user", "content": "What is rapamycin?"}]
ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
Recommended generation parameters: temperature=0.15, top_p=0.9, top_k=50, repetition_penalty=1.1.
Inference speed
All measurements are out-of-the-box via the Hugging Face transformers library only β not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU.
| Model | tok/s |
|---|---|
| Nautile-370M (Triton kernel) | 125.9 |
| Nautile-370M | 108.3 |
| LFM2.5-350M | 72.9 |
| Qwen2.5-0.5B | 44.5 |
| SmolLM2-360M | 33.8 |
| Baguettotron | 14.4 |
acceleration="auto" (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first generate() call (~2 s overhead) and reused for all subsequent calls.
Training
Stage 1 β Pretraining (~350B tokens, TPU v4-64): FineWeb-Edu for broad factual and linguistic coverage.
Stage 2 β Supervised fine-tuning (~250B tokens, TPU v4-64): SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment.
Stage 3 β Reinforcement learning (DGX Spark): Three-stage RL pipeline:
- Dr. GRPO with LLM-judge rewards for format alignment
- Gradient-balanced GRPO β decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K
- Scored self-distillation β fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K
| Stage | GSM8K |
|---|---|
| After SFT | 27.98% |
| + Dr. GRPO | 28.96% |
| + Gradient-balanced GRPO | 31.36% |
| + Scored self-distillation | 33.43% |
trust_remote_code=True
This model ships with custom modeling and tokenizer code. Pass trust_remote_code=True to from_pretrained calls. The relevant files (modeling_seqcond.py, tokenization_seqcond.py, configuration_seqcond.py) are included in this repository.
Citation
@article{chenebaux2025nautile,
title = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model},
author = {Chenebaux, Maixent},
year = {2025},
note = {arXiv preprint, link coming soon}
}
- Downloads last month
- 14
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="trickstr-ai/nautile-370m", trust_remote_code=True)