Configuration Parsing Warning:Invalid JSON for config file config.json
Configuration Parsing Warning:In UNKNOWN_FILENAME: "chat_template_jinja" is not allowed to be empty
Nemotron-3-Nano-30B-A3B GLQ 3.5-bit Mixed Precision
NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16 quantized to 3.5 average bits per weight using GLQ with mixed-precision per-layer bit allocation.
This is a hybrid Mamba-Attention-MoE architecture (30B total, ~3B active parameters).
Usage
pip install glq>=0.2.6 mamba-ssm causal-conv1d
import glq.hf_integration
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-3.5bpw",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"xv0y5ncu/Nemotron-3-Nano-30B-A3B-GLQ-3.5bpw",
trust_remote_code=True,
)
inputs = tokenizer("The theory of relativity states that", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Quantization Details
- Method: GLQ (E8 lattice codebook + RHT + LDLQ error feedback)
- Average bpw: 3.5 (mixed: 75% at 4bpw, 25% at 2bpw)
- Calibration: 128 samples from WikiText-2, sequence length 2048
- Quantized sublayers: 6,004 (128 MoE experts per layer + attention + MLP)
- Model size: 23 GB (vs ~60 GB at bf16)
- Quantization time: 40 minutes on L40S with
--streaming
Sensitivity profiling assigns 4bpw to attention K/V projections and critical MoE experts, 2bpw to robust layers.
Requirements
glq>=0.2.6mamba-ssmandcausal-conv1d(for Mamba layers)trust_remote_code=True(custom architecture)- CUDA GPU
License
NVIDIA Open Model License (same as base model). See license.
- Downloads last month
- 1,145