Etropyyy
/

wsvd-cache

Image-Text-to-Text

Model card Files Files and versions

wsvd-cache / README.md

Etropyyy's picture

Update README.md

46f5a6a verified 8 days ago

|

history blame contribute delete

2.25 kB

	---
	pipeline_tag: image-text-to-text
	---

	# WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models

	WSVD is a method for efficient low-rank approximation designed to enable fast and efficient execution of Low-Precision Vision-Language Models (VLMs). By applying SVD at a finer granularity (per-head) and using element-wise importance to guide fine-tuning, WSVD achieves significant decoding speedups while maintaining high accuracy.

	## Resources

	- Paper: [WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models](https://huggingface.co/papers/2604.02570)
	- GitHub Repository: [SAI-Lab-NYU/WSVD](https://github.com/SAI-Lab-NYU/WSVD)
	- Project Page: [OpenReview](https://openreview.net/forum?id=zrmQ4koOw9)

	## 🌟 Highlights

	- 🧩 Per-head SVD to actually speed up decoding: WSVD applies SVD per attention head to avoid the “shared-latent reloading” overhead that can make conventional SVD slower at decode time, cutting KV-cache memory traffic and reconstruction cost.
	- 🎯 Accuracy-preserving compression: Fisher-weighted local FT + local QAT: WSVD uses element-wise importance to guide local fine-tuning of low-rank factors, then adds quantization-aware training with outlier handling—yielding a low-precision low-rank VLM with minimal accuracy drop.
	- 📊 System-level Triton fusion with Flash Decoding for real latency wins: WSVD integrates low-rank reconstruction directly into the flash-decoding fused kernel, translating rank reduction into practical speedups (over 1.8× decoding speedup vs. Flash Decoding).

	## Supported Models

	The implementation currently supports LLaVA-v1.5 and LLaVA-Next models. Pre-computed calibration cache files for LLaVA-1.5 (7B, 13B) and LLaVA-Next (7B, 13B) are available in the official repository to facilitate the reproduction of results.

	## Citation

	```bibtex
	@article{wang2026wsvd,
	title={WSVD: Weighted Low-Rank Approximation for Fast and Efficient Execution of Low-Precision Vision-Language Models},
	author={Wang, Haiyu and Wang, Yutong and Jiang, Jack and Zhang, Sai Qian},
	journal={arXiv preprint arXiv:2604.02570},
	year={2026}
	}
	```