Etropyyy nielsr HF Staff commited on
Commit
b21ee6f
·
1 Parent(s): 1e14253

Add model card for WSVD (#1)

Browse files

- Add model card for WSVD (f5166f6b639ad31403d82fd8761ca9d53b7781fe)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +35 -0
README.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: image-text-to-text
3
+ ---
4
+
5
+ # WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models
6
+
7
+ WSVD is a method for efficient low-rank approximation designed to enable fast and efficient execution of Low-Precision Vision-Language Models (VLMs). By applying SVD at a finer granularity (per-head) and using element-wise importance to guide fine-tuning, WSVD achieves significant decoding speedups while maintaining high accuracy.
8
+
9
+ ## Resources
10
+
11
+ - **Paper:** [WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models](https://huggingface.co/papers/2604.02570)
12
+ - **GitHub Repository:** [SAI-Lab-NYU/WSVD](https://github.com/SAI-Lab-NYU/WSVD)
13
+ - **Project Page:** [OpenReview](https://openreview.net/forum?id=zrmQ4koOw9)
14
+
15
+ ## 🌟 Highlights
16
+
17
+ - **🧩 Per-head SVD to actually speed up decoding:** WSVD applies SVD per attention head to avoid the “shared-latent reloading” overhead that can make conventional SVD slower at decode time, cutting KV-cache memory traffic and reconstruction cost.
18
+ - **🎯 Accuracy-preserving compression: Fisher-weighted local FT + local QAT:** WSVD uses element-wise importance to guide local fine-tuning of low-rank factors, then adds quantization-aware training with outlier handling—yielding a low-precision low-rank VLM with minimal accuracy drop.
19
+ - **📊 System-level Triton fusion with Flash Decoding for real latency wins:** WSVD integrates low-rank reconstruction directly into the flash-decoding fused kernel, translating rank reduction into practical speedups (over 1.8× decoding speedup vs. Flash Decoding).
20
+
21
+ ## Supported Models
22
+
23
+ The implementation currently supports **LLaVA-v1.5** and **LLaVA-Next** models. Pre-computed calibration cache files for LLaVA-1.5 (7B, 13B) and LLaVA-Next (7B, 13B) are available in the official repository to facilitate the reproduction of results.
24
+
25
+ ## Citation
26
+
27
+ ```bibtex
28
+ @inproceedings{
29
+ wsvd2026iclr,
30
+ title={WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models},
31
+ author={...},
32
+ booktitle={International Conference on Learning Representations (ICLR)},
33
+ year={2026}
34
+ }
35
+ ```