smo-clean-pens_4bit_20260316_164236

Model Description

This is a 4-bit NF4 quantized version of HuggingFaceTB/SmolVLM2-500M-Video-Instruct optimized for robot control applications.

Quantization Details:

  • Method: NF4 (NormalFloat 4-bit) using BitsAndBytes
  • Original Size: ~945 MB (FP16/BF16)
  • Quantized Size: ~275 MB (4-bit)
  • Size Reduction: 71% smaller
  • Memory Reduction: 47% less VRAM usage
  • Accuracy: ~99% preserved
  • Latency: +11% slower (acceptable for most robot control tasks)

Created: 20260316 164236

Usage

Loading the Model

from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

# Load quantized model
model = AutoModelForVision2Seq.from_pretrained(
    "ksj0202/smo-clean-pens_4bit_20260316_164236",
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

Using with LeRobot Policy

from lerobot.policies.factory import get_policy_class, make_policy_config
import torch

# Load policy configuration
policy = load_your_policy_config()  # Your policy setup

# Replace VLM with quantized version
from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

vlm = AutoModelForVision2Seq.from_pretrained(
    "ksj0202/smo-clean-pens_4bit_20260316_164236",
    quantization_config=bnb_config,
    device_map={"": 0},
    trust_remote_code=True,
)

policy.model.vlm_with_expert.vlm = vlm

Performance Benchmarks

Metric FP16 Original 4-bit NF4 Change
Latency 7.04 ms 7.82 ms +11%
Memory 1.199 GB 0.640 GB -47%
VLM Size 0.945 GB 0.275 GB -71%
Accuracy 100% ~99% -1%

Test Configuration:

  • Input: 480×640 RGB images + 6D robot state + language tokens
  • Hardware: NVIDIA GPU with CUDA 12.6
  • Framework: PyTorch 2.7.1, Transformers 5.2.0.dev0

System Requirements

  • GPU: NVIDIA GPU with CUDA support (compute capability 7.0+)
  • VRAM: ~1-2 GB (vs 2-3 GB for FP16)
  • Python: 3.10+
  • PyTorch: 2.0+
  • Transformers: 5.0+
  • BitsAndBytes: 0.49+

Installation

pip install torch>=2.0.0
pip install transformers>=5.0.0
pip install bitsandbytes>=0.49.0
pip install accelerate

Quantization Method: NF4

NF4 (NormalFloat 4-bit) is an information-theoretically optimal 4-bit quantization format for normally distributed weights:

  • Uses 16 discrete levels optimized for normal distribution
  • Quantization bins: [-1.0, -0.6962, -0.5251, ..., 0.5251, 0.6962, 1.0]
  • Block-wise quantization (typically 64-128 elements per block)
  • Minimal MSE error: ~0.0047σ² (vs 0.0067σ² for uniform INT4)

Advantages over other methods:

  • Better than uniform INT4 (information-theoretic optimality)
  • Better size than INT8 (4-bit vs 8-bit)
  • Better speed than INT8 (simpler dequantization)
  • Better compatibility than 2-bit/3-bit (mature tooling)

Use Cases

This model is optimized for:

  • ✅ Robot vision-language-action policies
  • ✅ Edge device deployment (limited VRAM)
  • ✅ Real-time robot control (< 10ms latency)
  • ✅ Multi-robot systems (deploy multiple instances)
  • ✅ Development/testing (faster iteration)

Limitations

  • Slightly slower inference (+11%) due to dequantization overhead
  • Minimal accuracy loss (~1%, negligible for most tasks)
  • Requires BitsAndBytes library (CUDA-only, not CPU)
  • Not compatible with pure ONNX export (use PyTorch)

Citation

If you use this model, please cite:

@misc{smo-clean-pens_4bit_20260316_164236,
  author = {ksj0202},
  title = {NF4 4-bit Quantized SmolVLM for Robot Control},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ksj0202/smo-clean-pens_4bit_20260316_164236}},
}

Original model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct

License

Apache 2.0 (same as base model)

Contact

For issues or questions, please open an issue on the repository.


Note: This is a quantized model. For full precision version, see HuggingFaceTB/SmolVLM2-500M-Video-Instruct.

Downloads last month
6
Safetensors
Model size
0.5B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ksj0202/smo-clean-pens_4bit_20260316_164236