| --- |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models |
|
|
| WSVD is a method for efficient low-rank approximation designed to enable fast and efficient execution of Low-Precision Vision-Language Models (VLMs). By applying SVD at a finer granularity (per-head) and using element-wise importance to guide fine-tuning, WSVD achieves significant decoding speedups while maintaining high accuracy. |
|
|
| ## Resources |
|
|
| - **Paper:** [WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models](https://huggingface.co/papers/2604.02570) |
| - **GitHub Repository:** [SAI-Lab-NYU/WSVD](https://github.com/SAI-Lab-NYU/WSVD) |
| - **Project Page:** [OpenReview](https://openreview.net/forum?id=zrmQ4koOw9) |
|
|
| ## 🌟 Highlights |
|
|
| - **🧩 Per-head SVD to actually speed up decoding:** WSVD applies SVD per attention head to avoid the “shared-latent reloading” overhead that can make conventional SVD slower at decode time, cutting KV-cache memory traffic and reconstruction cost. |
| - **🎯 Accuracy-preserving compression: Fisher-weighted local FT + local QAT:** WSVD uses element-wise importance to guide local fine-tuning of low-rank factors, then adds quantization-aware training with outlier handling—yielding a low-precision low-rank VLM with minimal accuracy drop. |
| - **📊 System-level Triton fusion with Flash Decoding for real latency wins:** WSVD integrates low-rank reconstruction directly into the flash-decoding fused kernel, translating rank reduction into practical speedups (over 1.8× decoding speedup vs. Flash Decoding). |
|
|
| ## Supported Models |
|
|
| The implementation currently supports **LLaVA-v1.5** and **LLaVA-Next** models. Pre-computed calibration cache files for LLaVA-1.5 (7B, 13B) and LLaVA-Next (7B, 13B) are available in the official repository to facilitate the reproduction of results. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{ |
| wsvd2026iclr, |
| title={WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models}, |
| author={...}, |
| booktitle={International Conference on Learning Representations (ICLR)}, |
| year={2026} |
| } |
| ``` |