Nautile-370M

Nautile-370M cover

Nautile-370M is a 371M-parameter hybrid language model for reasoning and language understanding.

Its backbone alternates two SeqCond Attention (SCA) layers — a spectral sequence operator grounded in the derivative of the empirical characteristic function — with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers.

The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark.

A technical report is available on arXiv: 2604.24809.

Architecture

The backbone repeats the pattern SCA → SCA → Transformer eight times (24 layers total).

Parameters	371M
Layers	24 (16 SCA + 8 Transformer)
Model dimension	1024
FF dimension	2730
Context length	4096
Tokenizer	`cl100k_base` (tiktoken)
Weight tying	Yes
Dtype	bfloat16

SCA layers maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case.

Transformer layers (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads).

Intended use

Nautile-370M is designed for language understanding, common-sense reasoning, and classification tasks:

Sentiment analysis, intent detection, topic labeling
Structured information extraction
Fine-tuning for domain-specific classification
Large-scale opinion modeling (thousands of instances in parallel on modest hardware)

It is not designed for open-ended multi-turn chat, code generation, or knowledge-intensive QA — tasks that benefit more from scale than from architectural efficiency at this parameter count.

Benchmarks

0-shot evaluation against models of similar size:

Benchmark comparison

Benchmark	Nautile-370M	Qwen2.5-0.5B	Granite-350M	LFM2.5-350M	SmolLM2-360M
Training tokens	~0.8T	18T	10–12T	28T	4T
OpenBookQA	49.3	34.4	31.6	26.4	24.2
ARC	57.0	50.1	30.0	32.8	43.7
CommonsenseQA	46.8	46.5	36.2	44.3	18.4
GSM8K	33.4	28.3	31.5	33.0	7.4
PIQA	61.5	61.3	50.8	49.5	48.2
IFEval	36.9	31.6	55.4	62.4	41.0
TriviaQA	23.8	27.8	25.2	22.9	28.0
MATH500	2.4	18.8	5.6	12.2	0.0
MMLU-Pro	14.9	14.3	11.2	18.6	10.3
MMLU	39.2	33.7	35.0	39.1	35.8
GPQA Diamond	27.3	10.1	26.3	24.8	23.2
Average	35.7	32.4	30.8	33.3	25.5

Average benchmark score versus training tokens

All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10–28T for the other models in this table.

Generation quality — LLM-as-a-judge

We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination.

LLM-as-a-judge win rate comparison

Comparison	Nautile-370M wins	Reference wins	Tie
vs LFM2.5-350M	57%	42%	1%
vs Granite-350M	63%	35%	2%
vs Qwen2.5-0.5B	74%	22%	4%
vs SmolLM2-360M	63%	36%	1%

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "trickstr-ai/nautile-370m",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(
    "trickstr-ai/nautile-370m",
    trust_remote_code=True,
)

# encode_chat() wraps your prompt in the think-then-answer template.
input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda()

# acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton
# kernels automatically if the triton package is installed.
# CUDA graphs are captured on the first generate() call (~2 s overhead once).
output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.15,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    # acceleration="auto",      # default — cuda_graph + triton if available
    # acceleration="cuda_graph", # cuda graph only
    # acceleration="none",       # plain PyTorch, no graph capture
)
print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist()))

Chat template

The model uses a ChatML format with a chain-of-thought section delimited by <|think_start|> / <|think_end|>:

<|im_start|>user
{prompt}
<|im_end|><|im_start|>assistant
<|think_start|>{chain of thought}<|think_end|>
{answer}
<|im_end|>

Token	ID
`<\|im_start\|>`	100278
`<\|im_end\|>`	100279
`<\|think_start\|>`	100280
`<\|think_end\|>`	100281

You can also use apply_chat_template for multi-turn conversations:

messages = [{"role": "user", "content": "What is rapamycin?"}]
ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

Recommended generation parameters: temperature=0.15, top_p=0.9, top_k=50, repetition_penalty=1.1.

Inference speed

All measurements are out-of-the-box via the Hugging Face transformers library only — not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU.

Out-of-the-box Hugging Face Transformers inference speed comparison

Model	tok/s
Nautile-370M (Triton kernel)	125.9
Nautile-370M	108.3
LFM2.5-350M	72.9
Qwen2.5-0.5B	44.5
SmolLM2-360M	33.8
Baguettotron	14.4

acceleration="auto" (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first generate() call (~2 s overhead) and reused for all subsequent calls.

Training

Stage 1 — Pretraining (~350B tokens, TPU v4-64): FineWeb-Edu for broad factual and linguistic coverage.

Stage 2 — Supervised fine-tuning (~250B tokens, TPU v4-64): SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment.

Stage 3 — Reinforcement learning (DGX Spark): Three-stage RL pipeline:

Dr. GRPO with LLM-judge rewards for format alignment
Gradient-balanced GRPO — decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K
Scored self-distillation — fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K

Stage	GSM8K
After SFT	27.98%
+ Dr. GRPO	28.96%
+ Gradient-balanced GRPO	31.36%
+ Scored self-distillation	33.43%

`trust_remote_code=True`

This model ships with custom modeling and tokenizer code. Pass trust_remote_code=True to from_pretrained calls. The relevant files (modeling_seqcond.py, tokenization_seqcond.py, configuration_seqcond.py) are included in this repository.

Citation

@article{chenebaux2025nautile,
  title   = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model},
  author  = {Chenebaux, Maixent},
  year    = {2025},
  note    = {arXiv preprint, link coming soon}
}

Downloads last month: 14

Datasets used to train trickstr-ai/nautile-370m

Paper for trickstr-ai/nautile-370m

Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model

Paper • 2604.24809 • Published 11 days ago