Noema-1

Noema-1 is a 2.86B parameter Mixture-of-Experts (MoE) Masked Diffusion Language Model specialized in mathematical reasoning. Unlike autoregressive models that generate one token at a time, Noema-1 generates entire blocks of tokens in parallel through iterative denoising, achieving significantly higher throughput.

Key Features

2.86B total parameters, ~720M active per token — efficient MoE routing activates only 4 of 32 experts per layer
Parallel token generation — generates 32 tokens simultaneously per denoising step
Block-wise iterative refinement — each block undergoes confidence-thresholded denoising with optional token editing
Two inference modes:
- Q-Mode (Quality): threshold=0.7, editing_threshold=0.5 — higher accuracy with token editing
- S-Mode (Speed): threshold=0.5, editing_threshold=0.0 — faster generation, suitable for most tasks
Strong mathematical reasoning — trained on 600K curated math problems with chain-of-thought solutions

Architecture

Component	Value
Total Parameters	2.86B
Active Parameters/Token	~720M
Hidden Size	2048
Layers	20 (1 dense + 19 MoE)
Attention Heads	16 (4 KV heads, GQA)
Experts per Layer	32
Active Experts per Token	4
Expert Groups	8 (4 experts selected per group)
Shared Experts	1 per MoE layer
Expert FFN Size	512
Vocabulary	157,184 tokens
Max Context	16,384 tokens
Precision	bfloat16

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
)

prompt = "Prove that the square root of 2 is irrational."
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(
        inputs=input_ids,
        gen_length=1024,
        block_length=32,
        threshold=0.7,          # Q-Mode (use 0.5 for S-Mode)
        editing_threshold=0.5,  # Q-Mode (use 0.0 for S-Mode)
        max_post_steps=16,
        temperature=0.0,
        eos_early_stop=True,
    )

response = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Generation Parameters

Parameter	Description	Q-Mode	S-Mode	Range
`threshold`	Confidence threshold for accepting denoised tokens	0.7	0.5	0.0 - 1.0
`editing_threshold`	Confidence threshold for editing already-committed tokens	0.5	0.0	0.0 - 1.0
`block_length`	Number of tokens generated in parallel per block	32	32	16 - 128
`gen_length`	Maximum tokens to generate	1024	1024	1 - 4096
`max_post_steps`	Refinement iterations after all masks are resolved	16	16	0 - 32
`temperature`	Sampling temperature (0.0 = greedy)	0.0	0.0	0.0 - 2.0
`top_k`	Top-k sampling (None = disabled)	None	None	1 - 1000
`top_p`	Nucleus sampling (None = disabled)	None	None	0.0 - 1.0
`steps`	Max denoising steps per block	32	32	1 - 64
`eos_early_stop`	Stop at first EOS token	True	True	bool

How Generation Works

Noema-1 uses block-wise masked diffusion — a fundamentally different approach from autoregressive generation:

Initialization: The output space is filled with [MASK] tokens
Block iteration: Blocks of block_length tokens are processed left-to-right
Parallel denoising: Within each block, all masked tokens are predicted simultaneously
Confidence thresholding: Only tokens with confidence above threshold are committed; the rest remain masked for the next iteration
Token editing (Q-Mode): Already-committed tokens can be revised if the model's new prediction has higher confidence
Early stopping: Generation halts when an EOS token is produced

This enables parallel token generation — multiple tokens are decoded per forward pass, unlike autoregressive models which produce exactly one token per pass.

Performance Benchmarks (1x H100 80GB, bf16)

Setup	Throughput	Latency (1024 tok)	Optimization
HuggingFace `generate()` (Q-mode)	16 tok/s	~64s	None
HuggingFace `generate()` (S-mode)	33 tok/s	~29s	None
dInfer + KV-cache (S-mode)	198 tok/s	~5.3s	Prefix KV-cache
dInfer + KV-cache + torch.compile*	~400-600 tok/s	~2s	+ Compiled forward
dInfer + full stack*	~1000-1500 tok/s	<1s	+ CUDA graphs + FP8

*Projected based on component-level benchmarks. KV-cache result is measured.

Key metric: TPF (tokens per forward) = 5.92 in S-mode, meaning ~6 tokens are committed per model forward pass via parallel denoising.

Inference Modes

Standard (HuggingFace)

pip install transformers torch
python inference.py --prompt "Solve: x^2 - 5x + 6 = 0"

Interactive Chat (Streamlit)

pip install streamlit transformers torch
streamlit run app.py

Optimized Inference (dInfer) -- 198 tok/s measured

For production deployment, use the dInfer inference engine with prefix KV-caching:

pip install dinfer vllm
python inference_dinfer.py --model shouryamaanjain/Noema-1

The KV-cache avoids redundant recomputation of finalized blocks — each denoising step only processes the current 32-token block instead of the entire context. See inference_dinfer.py for the full pipeline.

NF4 Quantized Inference (Consumer GPUs)

Noema-1 runs on GPUs with as little as 4 GB VRAM using NF4 quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_skip_modules=["gate", "word_embeddings", "lm_head", "norm"],
)

model = AutoModelForCausalLM.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto",
)

Training Details

Training data: 600K curated math problems from OpenMathReasoning (CoT solutions from DeepSeek-R1 and QwQ-32B) and NuminaMath-CoT (GPT-4 Turbo solutions)
Decontamination: 10-gram shingle overlap filtering against GSM8K, MATH-500, AIME 2024/2025, Omni-MATH, and OlympiadBench test sets
Training objective: Masked Diffusion (M2T denoising) with variable noise ratio (0.3-0.8)
Infrastructure: 8x H100 80GB, FSDP2 full fine-tuning
Sequence length: 4096 tokens
Optimizer: AdamW (beta1=0.9, beta2=0.999, lr=2e-6 with cosine decay)
Training duration: ~80% of 1 epoch (30,000 optimization steps)

Limitations

Specialized for mathematical reasoning — general conversation and coding capabilities are limited
The masked diffusion generation paradigm requires trust_remote_code=True for custom generate() method
Optimized inference (1000+ tok/s) requires the dInfer framework; the standard HuggingFace generate() runs at ~30-50 tok/s

Citation

If you use Noema-1 in your research, please cite:

@misc{noema1,
    title={Noema-1: A Compact MoE Masked Diffusion Model for Mathematical Reasoning},
    author={Shourya Maan Jain},
    year={2026},
    url={https://huggingface.co/shouryamaanjain/Noema-1}
}

License

Apache 2.0

Downloads last month: 314