smo-clean-pens_4bit_20260316_164236

Model Description

This is a 4-bit NF4 quantized version of HuggingFaceTB/SmolVLM2-500M-Video-Instruct optimized for robot control applications.

Quantization Details:

Method: NF4 (NormalFloat 4-bit) using BitsAndBytes
Original Size: ~945 MB (FP16/BF16)
Quantized Size: ~275 MB (4-bit)
Size Reduction: 71% smaller
Memory Reduction: 47% less VRAM usage
Accuracy: ~99% preserved
Latency: +11% slower (acceptable for most robot control tasks)

Created: 20260316 164236

Usage

Loading the Model

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

# Load quantized model
model = AutoModelForVision2Seq.from_pretrained(
    "ksj0202/smo-clean-pens_4bit_20260316_164236",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

Using with LeRobot Policy

from lerobot.policies.factory import get_policy_class, make_policy_config
import torch

# Load policy configuration
policy = load_your_policy_config()  # Your policy setup

# Replace VLM with quantized version
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

vlm = AutoModelForVision2Seq.from_pretrained(
    "ksj0202/smo-clean-pens_4bit_20260316_164236",
    quantization_config=bnb_config,
    device_map={"": 0},
    trust_remote_code=True,
)

policy.model.vlm_with_expert.vlm = vlm

Performance Benchmarks

Metric	FP16 Original	4-bit NF4	Change
Latency	7.04 ms	7.82 ms	+11%
Memory	1.199 GB	0.640 GB	-47%
VLM Size	0.945 GB	0.275 GB	-71%
Accuracy	100%	~99%	-1%

Test Configuration:

Input: 480×640 RGB images + 6D robot state + language tokens
Hardware: NVIDIA GPU with CUDA 12.6
Framework: PyTorch 2.7.1, Transformers 5.2.0.dev0

System Requirements

GPU: NVIDIA GPU with CUDA support (compute capability 7.0+)
VRAM: ~1-2 GB (vs 2-3 GB for FP16)
Python: 3.10+
PyTorch: 2.0+
Transformers: 5.0+
BitsAndBytes: 0.49+

Installation

pip install torch>=2.0.0
pip install transformers>=5.0.0
pip install bitsandbytes>=0.49.0
pip install accelerate

Quantization Method: NF4

NF4 (NormalFloat 4-bit) is an information-theoretically optimal 4-bit quantization format for normally distributed weights:

Uses 16 discrete levels optimized for normal distribution
Quantization bins: [-1.0, -0.6962, -0.5251, ..., 0.5251, 0.6962, 1.0]
Block-wise quantization (typically 64-128 elements per block)
Minimal MSE error: ~0.0047σ² (vs 0.0067σ² for uniform INT4)

Advantages over other methods:

Better than uniform INT4 (information-theoretic optimality)
Better size than INT8 (4-bit vs 8-bit)
Better speed than INT8 (simpler dequantization)
Better compatibility than 2-bit/3-bit (mature tooling)

Use Cases

This model is optimized for:

✅ Robot vision-language-action policies
✅ Edge device deployment (limited VRAM)
✅ Real-time robot control (< 10ms latency)
✅ Multi-robot systems (deploy multiple instances)
✅ Development/testing (faster iteration)

Limitations

Slightly slower inference (+11%) due to dequantization overhead
Minimal accuracy loss (~1%, negligible for most tasks)
Requires BitsAndBytes library (CUDA-only, not CPU)
Not compatible with pure ONNX export (use PyTorch)

Citation

If you use this model, please cite:

@misc{smo-clean-pens_4bit_20260316_164236,
  author = {ksj0202},
  title = {NF4 4-bit Quantized SmolVLM for Robot Control},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ksj0202/smo-clean-pens_4bit_20260316_164236}},
}

Original model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct