Noema-1
Noema-1 is a 2.86B parameter Mixture-of-Experts (MoE) Masked Diffusion Language Model specialized in mathematical reasoning. Unlike autoregressive models that generate one token at a time, Noema-1 generates entire blocks of tokens in parallel through iterative denoising, achieving significantly higher throughput.
Key Features
- 2.86B total parameters, ~720M active per token โ efficient MoE routing activates only 4 of 32 experts per layer
- Parallel token generation โ generates 32 tokens simultaneously per denoising step
- Block-wise iterative refinement โ each block undergoes confidence-thresholded denoising with optional token editing
- Two inference modes:
- Q-Mode (Quality): threshold=0.7, editing_threshold=0.5 โ higher accuracy with token editing
- S-Mode (Speed): threshold=0.5, editing_threshold=0.0 โ faster generation, suitable for most tasks
- Strong mathematical reasoning โ trained on 600K curated math problems with chain-of-thought solutions
Architecture
| Component | Value |
|---|---|
| Total Parameters | 2.86B |
| Active Parameters/Token | ~720M |
| Hidden Size | 2048 |
| Layers | 20 (1 dense + 19 MoE) |
| Attention Heads | 16 (4 KV heads, GQA) |
| Experts per Layer | 32 |
| Active Experts per Token | 4 |
| Expert Groups | 8 (4 experts selected per group) |
| Shared Experts | 1 per MoE layer |
| Expert FFN Size | 512 |
| Vocabulary | 157,184 tokens |
| Max Context | 16,384 tokens |
| Precision | bfloat16 |
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"shouryamaanjain/Noema-1",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
"shouryamaanjain/Noema-1",
trust_remote_code=True,
)
prompt = "Prove that the square root of 2 is irrational."
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(
inputs=input_ids,
gen_length=1024,
block_length=32,
threshold=0.7, # Q-Mode (use 0.5 for S-Mode)
editing_threshold=0.5, # Q-Mode (use 0.0 for S-Mode)
max_post_steps=16,
temperature=0.0,
eos_early_stop=True,
)
response = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True)
print(response)
Generation Parameters
| Parameter | Description | Q-Mode | S-Mode | Range |
|---|---|---|---|---|
threshold |
Confidence threshold for accepting denoised tokens | 0.7 | 0.5 | 0.0 - 1.0 |
editing_threshold |
Confidence threshold for editing already-committed tokens | 0.5 | 0.0 | 0.0 - 1.0 |
block_length |
Number of tokens generated in parallel per block | 32 | 32 | 16 - 128 |
gen_length |
Maximum tokens to generate | 1024 | 1024 | 1 - 4096 |
max_post_steps |
Refinement iterations after all masks are resolved | 16 | 16 | 0 - 32 |
temperature |
Sampling temperature (0.0 = greedy) | 0.0 | 0.0 | 0.0 - 2.0 |
top_k |
Top-k sampling (None = disabled) | None | None | 1 - 1000 |
top_p |
Nucleus sampling (None = disabled) | None | None | 0.0 - 1.0 |
steps |
Max denoising steps per block | 32 | 32 | 1 - 64 |
eos_early_stop |
Stop at first EOS token | True | True | bool |
How Generation Works
Noema-1 uses block-wise masked diffusion โ a fundamentally different approach from autoregressive generation:
- Initialization: The output space is filled with
[MASK]tokens - Block iteration: Blocks of
block_lengthtokens are processed left-to-right - Parallel denoising: Within each block, all masked tokens are predicted simultaneously
- Confidence thresholding: Only tokens with confidence above
thresholdare committed; the rest remain masked for the next iteration - Token editing (Q-Mode): Already-committed tokens can be revised if the model's new prediction has higher confidence
- Early stopping: Generation halts when an EOS token is produced
This enables parallel token generation โ multiple tokens are decoded per forward pass, unlike autoregressive models which produce exactly one token per pass.
Performance Benchmarks (1x H100 80GB, bf16)
| Setup | Throughput | Latency (1024 tok) | Optimization |
|---|---|---|---|
HuggingFace generate() (Q-mode) |
16 tok/s | ~64s | None |
HuggingFace generate() (S-mode) |
33 tok/s | ~29s | None |
| dInfer + KV-cache (S-mode) | 198 tok/s | ~5.3s | Prefix KV-cache |
| dInfer + KV-cache + torch.compile* | ~400-600 tok/s | ~2s | + Compiled forward |
| dInfer + full stack* | ~1000-1500 tok/s | <1s | + CUDA graphs + FP8 |
*Projected based on component-level benchmarks. KV-cache result is measured.
Key metric: TPF (tokens per forward) = 5.92 in S-mode, meaning ~6 tokens are committed per model forward pass via parallel denoising.
Inference Modes
Standard (HuggingFace)
pip install transformers torch
python inference.py --prompt "Solve: x^2 - 5x + 6 = 0"
Interactive Chat (Streamlit)
pip install streamlit transformers torch
streamlit run app.py
Optimized Inference (dInfer) -- 198 tok/s measured
For production deployment, use the dInfer inference engine with prefix KV-caching:
pip install dinfer vllm
python inference_dinfer.py --model shouryamaanjain/Noema-1
The KV-cache avoids redundant recomputation of finalized blocks โ each denoising step only processes the current 32-token block instead of the entire context. See inference_dinfer.py for the full pipeline.
NF4 Quantized Inference (Consumer GPUs)
Noema-1 runs on GPUs with as little as 4 GB VRAM using NF4 quantization:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_skip_modules=["gate", "word_embeddings", "lm_head", "norm"],
)
model = AutoModelForCausalLM.from_pretrained(
"shouryamaanjain/Noema-1",
trust_remote_code=True,
quantization_config=bnb_config,
device_map="auto",
)
Training Details
- Training data: 600K curated math problems from OpenMathReasoning (CoT solutions from DeepSeek-R1 and QwQ-32B) and NuminaMath-CoT (GPT-4 Turbo solutions)
- Decontamination: 10-gram shingle overlap filtering against GSM8K, MATH-500, AIME 2024/2025, Omni-MATH, and OlympiadBench test sets
- Training objective: Masked Diffusion (M2T denoising) with variable noise ratio (0.3-0.8)
- Infrastructure: 8x H100 80GB, FSDP2 full fine-tuning
- Sequence length: 4096 tokens
- Optimizer: AdamW (beta1=0.9, beta2=0.999, lr=2e-6 with cosine decay)
- Training duration: ~80% of 1 epoch (30,000 optimization steps)
Limitations
- Specialized for mathematical reasoning โ general conversation and coding capabilities are limited
- The masked diffusion generation paradigm requires
trust_remote_code=Truefor customgenerate()method - Optimized inference (1000+ tok/s) requires the dInfer framework; the standard HuggingFace
generate()runs at ~30-50 tok/s
Citation
If you use Noema-1 in your research, please cite:
@misc{noema1,
title={Noema-1: A Compact MoE Masked Diffusion Model for Mathematical Reasoning},
author={Shourya Maan Jain},
year={2026},
url={https://huggingface.co/shouryamaanjain/Noema-1}
}
License
Apache 2.0
- Downloads last month
- 314