Configuration Parsing Warning:Invalid JSON for config file config.json

Configuration Parsing Warning:In UNKNOWN_FILENAME: "chat_template_jinja" is not allowed to be empty

Nemotron-3-Nano-30B-A3B GLQ 3.5-bit Mixed Precision

NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 quantized to 3.5 average bits per weight using GLQ with mixed-precision per-layer bit allocation.

This is a hybrid Mamba-Attention-MoE architecture (30B total, ~3B active parameters).

Usage

pip install glq>=0.2.6 mamba-ssm causal-conv1d
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-3.5bpw",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-3.5bpw",
    trust_remote_code=True,
)

inputs = tokenizer("The theory of relativity states that", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Quantization Details

  • Method: GLQ (E8 lattice codebook + RHT + LDLQ error feedback)
  • Average bpw: 3.5 (mixed: 75% at 4bpw, 25% at 2bpw)
  • Calibration: 128 samples from WikiText-2, sequence length 2048
  • Quantized sublayers: 6,004 (128 MoE experts per layer + attention + MLP)
  • Model size: 23 GB (vs ~60 GB at bf16)
  • Quantization time: 40 minutes on L40S with --streaming

Sensitivity profiling assigns 4bpw to attention K/V projections and critical MoE experts, 2bpw to robust layers.

Requirements

  • glq>=0.2.6
  • mamba-ssm and causal-conv1d (for Mamba layers)
  • trust_remote_code=True (custom architecture)
  • CUDA GPU

License

NVIDIA Open Model License (same as base model). See license.

Downloads last month
1,145
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-3.5bpw

Finetuned
(3)
this model