Add model card for WSVD
#1
by nielsr HF Staff - opened
README.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: image-text-to-text
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
|
| 6 |
+
|
| 7 |
+
WSVD is a method for efficient low-rank approximation designed to enable fast and efficient execution of Low-Precision Vision-Language Models (VLMs). By applying SVD at a finer granularity (per-head) and using element-wise importance to guide fine-tuning, WSVD achieves significant decoding speedups while maintaining high accuracy.
|
| 8 |
+
|
| 9 |
+
## Resources
|
| 10 |
+
|
| 11 |
+
- **Paper:** [WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models](https://huggingface.co/papers/2604.02570)
|
| 12 |
+
- **GitHub Repository:** [SAI-Lab-NYU/WSVD](https://github.com/SAI-Lab-NYU/WSVD)
|
| 13 |
+
- **Project Page:** [OpenReview](https://openreview.net/forum?id=zrmQ4koOw9)
|
| 14 |
+
|
| 15 |
+
## 🌟 Highlights
|
| 16 |
+
|
| 17 |
+
- **🧩 Per-head SVD to actually speed up decoding:** WSVD applies SVD per attention head to avoid the “shared-latent reloading” overhead that can make conventional SVD slower at decode time, cutting KV-cache memory traffic and reconstruction cost.
|
| 18 |
+
- **🎯 Accuracy-preserving compression: Fisher-weighted local FT + local QAT:** WSVD uses element-wise importance to guide local fine-tuning of low-rank factors, then adds quantization-aware training with outlier handling—yielding a low-precision low-rank VLM with minimal accuracy drop.
|
| 19 |
+
- **📊 System-level Triton fusion with Flash Decoding for real latency wins:** WSVD integrates low-rank reconstruction directly into the flash-decoding fused kernel, translating rank reduction into practical speedups (over 1.8× decoding speedup vs. Flash Decoding).
|
| 20 |
+
|
| 21 |
+
## Supported Models
|
| 22 |
+
|
| 23 |
+
The implementation currently supports **LLaVA-v1.5** and **LLaVA-Next** models. Pre-computed calibration cache files for LLaVA-1.5 (7B, 13B) and LLaVA-Next (7B, 13B) are available in the official repository to facilitate the reproduction of results.
|
| 24 |
+
|
| 25 |
+
## Citation
|
| 26 |
+
|
| 27 |
+
```bibtex
|
| 28 |
+
@inproceedings{
|
| 29 |
+
wsvd2026iclr,
|
| 30 |
+
title={WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models},
|
| 31 |
+
author={...},
|
| 32 |
+
booktitle={International Conference on Learning Representations (ICLR)},
|
| 33 |
+
year={2026}
|
| 34 |
+
}
|
| 35 |
+
```
|