Noema-1

Noema-1 is a 2.86B parameter Mixture-of-Experts (MoE) Masked Diffusion Language Model specialized in mathematical reasoning. Unlike autoregressive models that generate one token at a time, Noema-1 generates entire blocks of tokens in parallel through iterative denoising, achieving significantly higher throughput.

Key Features

  • 2.86B total parameters, ~720M active per token โ€” efficient MoE routing activates only 4 of 32 experts per layer
  • Parallel token generation โ€” generates 32 tokens simultaneously per denoising step
  • Block-wise iterative refinement โ€” each block undergoes confidence-thresholded denoising with optional token editing
  • Two inference modes:
    • Q-Mode (Quality): threshold=0.7, editing_threshold=0.5 โ€” higher accuracy with token editing
    • S-Mode (Speed): threshold=0.5, editing_threshold=0.0 โ€” faster generation, suitable for most tasks
  • Strong mathematical reasoning โ€” trained on 600K curated math problems with chain-of-thought solutions

Architecture

Component Value
Total Parameters 2.86B
Active Parameters/Token ~720M
Hidden Size 2048
Layers 20 (1 dense + 19 MoE)
Attention Heads 16 (4 KV heads, GQA)
Experts per Layer 32
Active Experts per Token 4
Expert Groups 8 (4 experts selected per group)
Shared Experts 1 per MoE layer
Expert FFN Size 512
Vocabulary 157,184 tokens
Max Context 16,384 tokens
Precision bfloat16

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
)

prompt = "Prove that the square root of 2 is irrational."
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
).to(model.device)

with torch.no_grad():
    output = model.generate(
        inputs=input_ids,
        gen_length=1024,
        block_length=32,
        threshold=0.7,          # Q-Mode (use 0.5 for S-Mode)
        editing_threshold=0.5,  # Q-Mode (use 0.0 for S-Mode)
        max_post_steps=16,
        temperature=0.0,
        eos_early_stop=True,
    )

response = tokenizer.decode(output[0, input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Generation Parameters

Parameter Description Q-Mode S-Mode Range
threshold Confidence threshold for accepting denoised tokens 0.7 0.5 0.0 - 1.0
editing_threshold Confidence threshold for editing already-committed tokens 0.5 0.0 0.0 - 1.0
block_length Number of tokens generated in parallel per block 32 32 16 - 128
gen_length Maximum tokens to generate 1024 1024 1 - 4096
max_post_steps Refinement iterations after all masks are resolved 16 16 0 - 32
temperature Sampling temperature (0.0 = greedy) 0.0 0.0 0.0 - 2.0
top_k Top-k sampling (None = disabled) None None 1 - 1000
top_p Nucleus sampling (None = disabled) None None 0.0 - 1.0
steps Max denoising steps per block 32 32 1 - 64
eos_early_stop Stop at first EOS token True True bool

How Generation Works

Noema-1 uses block-wise masked diffusion โ€” a fundamentally different approach from autoregressive generation:

  1. Initialization: The output space is filled with [MASK] tokens
  2. Block iteration: Blocks of block_length tokens are processed left-to-right
  3. Parallel denoising: Within each block, all masked tokens are predicted simultaneously
  4. Confidence thresholding: Only tokens with confidence above threshold are committed; the rest remain masked for the next iteration
  5. Token editing (Q-Mode): Already-committed tokens can be revised if the model's new prediction has higher confidence
  6. Early stopping: Generation halts when an EOS token is produced

This enables parallel token generation โ€” multiple tokens are decoded per forward pass, unlike autoregressive models which produce exactly one token per pass.

Performance Benchmarks (1x H100 80GB, bf16)

Setup Throughput Latency (1024 tok) Optimization
HuggingFace generate() (Q-mode) 16 tok/s ~64s None
HuggingFace generate() (S-mode) 33 tok/s ~29s None
dInfer + KV-cache (S-mode) 198 tok/s ~5.3s Prefix KV-cache
dInfer + KV-cache + torch.compile* ~400-600 tok/s ~2s + Compiled forward
dInfer + full stack* ~1000-1500 tok/s <1s + CUDA graphs + FP8

*Projected based on component-level benchmarks. KV-cache result is measured.

Key metric: TPF (tokens per forward) = 5.92 in S-mode, meaning ~6 tokens are committed per model forward pass via parallel denoising.

Inference Modes

Standard (HuggingFace)

pip install transformers torch
python inference.py --prompt "Solve: x^2 - 5x + 6 = 0"

Interactive Chat (Streamlit)

pip install streamlit transformers torch
streamlit run app.py

Optimized Inference (dInfer) -- 198 tok/s measured

For production deployment, use the dInfer inference engine with prefix KV-caching:

pip install dinfer vllm
python inference_dinfer.py --model shouryamaanjain/Noema-1

The KV-cache avoids redundant recomputation of finalized blocks โ€” each denoising step only processes the current 32-token block instead of the entire context. See inference_dinfer.py for the full pipeline.

NF4 Quantized Inference (Consumer GPUs)

Noema-1 runs on GPUs with as little as 4 GB VRAM using NF4 quantization:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    llm_int8_skip_modules=["gate", "word_embeddings", "lm_head", "norm"],
)

model = AutoModelForCausalLM.from_pretrained(
    "shouryamaanjain/Noema-1",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto",
)

Training Details

  • Training data: 600K curated math problems from OpenMathReasoning (CoT solutions from DeepSeek-R1 and QwQ-32B) and NuminaMath-CoT (GPT-4 Turbo solutions)
  • Decontamination: 10-gram shingle overlap filtering against GSM8K, MATH-500, AIME 2024/2025, Omni-MATH, and OlympiadBench test sets
  • Training objective: Masked Diffusion (M2T denoising) with variable noise ratio (0.3-0.8)
  • Infrastructure: 8x H100 80GB, FSDP2 full fine-tuning
  • Sequence length: 4096 tokens
  • Optimizer: AdamW (beta1=0.9, beta2=0.999, lr=2e-6 with cosine decay)
  • Training duration: ~80% of 1 epoch (30,000 optimization steps)

Limitations

  • Specialized for mathematical reasoning โ€” general conversation and coding capabilities are limited
  • The masked diffusion generation paradigm requires trust_remote_code=True for custom generate() method
  • Optimized inference (1000+ tok/s) requires the dInfer framework; the standard HuggingFace generate() runs at ~30-50 tok/s

Citation

If you use Noema-1 in your research, please cite:

@misc{noema1,
    title={Noema-1: A Compact MoE Masked Diffusion Model for Mathematical Reasoning},
    author={Shourya Maan Jain},
    year={2026},
    url={https://huggingface.co/shouryamaanjain/Noema-1}
}

License

Apache 2.0

Downloads last month
314
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support