How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="trickstr-ai/nautile-370m", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("trickstr-ai/nautile-370m", trust_remote_code=True, dtype="auto")
Quick Links

Nautile-370M

Nautile-370M cover

Nautile-370M is a 371M-parameter hybrid language model for reasoning and language understanding.

Its backbone alternates two SeqCond Attention (SCA) layers β€” a spectral sequence operator grounded in the derivative of the empirical characteristic function β€” with one standard transformer layer, giving a 2:1 SCA/Transformer ratio across 24 layers.

The model was pretrained on ~0.8T tokens on a single TPU v4-64 pod slice (Google TRC program), then post-trained with reinforcement learning on a single NVIDIA DGX Spark.

A technical report is available on arXiv: 2604.24809.


Architecture

The backbone repeats the pattern SCA β†’ SCA β†’ Transformer eight times (24 layers total).

Parameters371M
Layers24 (16 SCA + 8 Transformer)
Model dimension1024
FF dimension2730
Context length4096
Tokenizercl100k_base (tiktoken)
Weight tyingYes
Dtypebfloat16

SCA layers maintain a fixed-size complex state updated in O(1) per token at inference (parallel prefix scan during training). They are theoretically expressive enough to reproduce any softmax attention output as a special case.

Transformer layers (every 3rd layer) use standard causal self-attention with RoPE and GQA (16 heads, 4 KV heads).


Intended use

Nautile-370M is designed for language understanding, common-sense reasoning, and classification tasks:

  • Sentiment analysis, intent detection, topic labeling
  • Structured information extraction
  • Fine-tuning for domain-specific classification
  • Large-scale opinion modeling (thousands of instances in parallel on modest hardware)

It is not designed for open-ended multi-turn chat, code generation, or knowledge-intensive QA β€” tasks that benefit more from scale than from architectural efficiency at this parameter count.


Benchmarks

0-shot evaluation against models of similar size:

Benchmark comparison

Benchmark Nautile-370M Qwen2.5-0.5B Granite-350M LFM2.5-350M SmolLM2-360M
Training tokens~0.8T18T10–12T28T4T
OpenBookQA49.334.431.626.424.2
ARC57.050.130.032.843.7
CommonsenseQA46.846.536.244.318.4
GSM8K33.428.331.533.07.4
PIQA61.561.350.849.548.2
IFEval36.931.655.462.441.0
TriviaQA23.827.825.222.928.0
MATH5002.418.85.612.20.0
MMLU-Pro14.914.311.218.610.3
MMLU39.233.735.039.135.8
GPQA Diamond27.310.126.324.823.2
Average 35.7 32.4 30.8 33.3 25.5

Average benchmark score versus training tokens

All scores are accuracy (%), 0-shot. Evaluation is strict: responses with multiple candidate answers are scored as incorrect. Nautile-370M reaches these numbers with ~0.8T training tokens, compared to 10–28T for the other models in this table.


Generation quality β€” LLM-as-a-judge

We evaluate generation quality using an LLM-as-a-judge setup (GPT-4.1). For a diverse set of prompts covering factual knowledge, commonsense reasoning, instruction following, explanatory writing, creative writing, and short analytical responses, the judge compares Nautile-370M's output against a reference model's output and selects the better answer based on overall quality. The evaluation emphasizes correctness, faithfulness to the prompt, clarity, coherence, and non-hallucination.

LLM-as-a-judge win rate comparison

Comparison Nautile-370M wins Reference wins Tie
vs LFM2.5-350M 57% 42% 1%
vs Granite-350M 63% 35% 2%
vs Qwen2.5-0.5B 74% 22% 4%
vs SmolLM2-360M 63% 36% 1%

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "trickstr-ai/nautile-370m",
    trust_remote_code=True,
    dtype=torch.bfloat16,
).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(
    "trickstr-ai/nautile-370m",
    trust_remote_code=True,
)

# encode_chat() wraps your prompt in the think-then-answer template.
input_ids = torch.tensor([tokenizer.encode_chat("What is rapamycin?")]).cuda()

# acceleration="auto" (default): uses CUDA graphs on GPU, adds Triton
# kernels automatically if the triton package is installed.
# CUDA graphs are captured on the first generate() call (~2 s overhead once).
output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.15,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
    # acceleration="auto",      # default β€” cuda_graph + triton if available
    # acceleration="cuda_graph", # cuda graph only
    # acceleration="none",       # plain PyTorch, no graph capture
)
print(tokenizer.decode(output[0, input_ids.shape[1]:].tolist()))

Chat template

The model uses a ChatML format with a chain-of-thought section delimited by <|think_start|> / <|think_end|>:

<|im_start|>user
{prompt}
<|im_end|><|im_start|>assistant
<|think_start|>{chain of thought}<|think_end|>
{answer}
<|im_end|>
Token ID
<|im_start|> 100278
<|im_end|> 100279
<|think_start|> 100280
<|think_end|> 100281

You can also use apply_chat_template for multi-turn conversations:

messages = [{"role": "user", "content": "What is rapamycin?"}]
ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

Recommended generation parameters: temperature=0.15, top_p=0.9, top_k=50, repetition_penalty=1.1.


Inference speed

All measurements are out-of-the-box via the Hugging Face transformers library only β€” not vLLM, TensorRT-LLM, or any other specialized serving stack. Batch size 1, bfloat16, single GPU.

Out-of-the-box Hugging Face Transformers inference speed comparison

Model tok/s
Nautile-370M (Triton kernel)125.9
Nautile-370M108.3
LFM2.5-350M72.9
Qwen2.5-0.5B44.5
SmolLM2-360M33.8
Baguettotron14.4

acceleration="auto" (default) enables CUDA graphs and Triton kernels automatically when available. CUDA graphs are captured once on the first generate() call (~2 s overhead) and reused for all subsequent calls.


Training

Stage 1 β€” Pretraining (~350B tokens, TPU v4-64): FineWeb-Edu for broad factual and linguistic coverage.

Stage 2 β€” Supervised fine-tuning (~250B tokens, TPU v4-64): SYNTH corpus of chain-of-thought reasoning traces, plus ~4M synthetic documents distilled from GPT-OSS-20B, GPT-OSS-120B, Mistral Small 3.2, and Mistral Large 3 with retrieval-guided format alignment.

Stage 3 β€” Reinforcement learning (DGX Spark): Three-stage RL pipeline:

  1. Dr. GRPO with LLM-judge rewards for format alignment
  2. Gradient-balanced GRPO β€” decouples positive/negative gradient components to prevent low-success-rate instability; +2.4 pp GSM8K
  3. Scored self-distillation β€” fine-tuning on the model's own verified correct reasoning traces with advantage-weighted loss; +2.1 pp GSM8K
Stage GSM8K
After SFT 27.98%
+ Dr. GRPO 28.96%
+ Gradient-balanced GRPO 31.36%
+ Scored self-distillation 33.43%

trust_remote_code=True

This model ships with custom modeling and tokenizer code. Pass trust_remote_code=True to from_pretrained calls. The relevant files (modeling_seqcond.py, tokenization_seqcond.py, configuration_seqcond.py) are included in this repository.


Citation

@article{chenebaux2025nautile,
  title   = {Nautile-370M: Spectral Memory Meets Attention in a Small Reasoning Model},
  author  = {Chenebaux, Maixent},
  year    = {2025},
  note    = {arXiv preprint, link coming soon}
}
Downloads last month
14
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train trickstr-ai/nautile-370m

Paper for trickstr-ai/nautile-370m