Enhance model card with pipeline tag, project page, and comprehensive details (#1)

Browse files

- Enhance model card with pipeline tag, project page, and comprehensive details (ffc485eaf9312d06574cf8b259d7986383f7d893)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +57 -6

README.md CHANGED Viewed

@@ -1,14 +1,65 @@
 ---
 license: gpl-3.0
 ---
-#MatchAttention
-Paper: https://arxiv.org/abs/2510.14260
-MatchAttention is a contiguous and differentiable sliding-window attention mechanism that enables long-range connection, explict matching, and linear complexity.
-When applied to stereo matching and optical flow, real-time and state-of-the-arts performance can be achieved.
-## How to Get Started with the Model
-Check out https://github.com/TingmanYan/MatchAttention

 ---
 license: gpl-3.0
+pipeline_tag: image-to-image
 ---
+# MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching
+[📄 Paper](https://arxiv.org/abs/2510.14260) - [🌐 Project Page](https://tingmanyan.github.io/MatchAttention/) - [🤗 Demo](https://huggingface.co/spaces/Tingman/MatchStereo) - [💻 Code](https://github.com/TingmanYan/MatchAttention)
+MatchAttention is a contiguous and differentiable sliding-window attention mechanism that enables long-range connection, explicit matching, and linear complexity. When applied to stereo matching and optical flow, real-time and state-of-the-art performance can be achieved.
+## Introduction
+This paper proposes an attention mechanism, MatchAttention, that dynamically matches relative positions. The relative position determines the attention sampling center of the key-value pairs given a query. Continuous and differentiable sliding-window attention sampling is achieved by the proposed BilinearSoftmax. The relative positions are iteratively updated through residual connections across layers by embedding them into the feature channels. Since the relative position is exactly the learning target for cross-view matching, an efficient hierarchical cross-view decoder, MatchDecoder, is designed with MatchAttention as its core component. To handle cross-view occlusions, gated cross-MatchAttention and a consistency-constrained loss are proposed. These two components collectively mitigate the impact of occlusions in both forward and backward passes, allowing the model to focus more on learning matching relationships. When applied to stereo matching, MatchStereo-B ranked 1st in average error on the public Middlebury benchmark and requires only 29ms for KITTI-resolution inference. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory. The proposed models also achieve state-of-the-art performance on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets. The combination of high accuracy and low computational complexity makes real-time, high-resolution, and high-accuracy cross-view matching possible.
+## Key Features
+*   **Efficient and Scalable**: Achieves linear complexity, enabling high-resolution image processing with low GPU memory. MatchStereo-T can process 4K UHD images in 0.1 seconds using only 3GB of GPU memory.
+*   **State-of-the-Art Performance**: Ranked 1st in average error on the Middlebury benchmark for stereo matching and achieves state-of-the-art results on KITTI 2012, KITTI 2015, ETH3D, and Spring flow datasets.
+*   **Explainable Occlusion Handling**: Introduces gated cross-MatchAttention and a consistency-constrained loss to effectively mitigate the impact of occlusions, allowing the model to focus on learning robust matching relationships.
+## Model Weights
+The following pre-trained model weights are available:
+| Model         | Params | Resolution   | FLOPs | GPU Mem | Latency | Checkpoint                                                                                                 |
+| :------------ | :----- | :----------- | :---- | :------ | :------ | :--------------------------------------------------------------------------------------------------------- |
+| MatchStereo-T | 8.78M  | 1536x1536    | 0.34T | 1.45G   | 38ms    | [Hugging Face](https://huggingface.co/Tingman/MatchAttention/blob/main/matchstereo_tiny_fsd.pth)          |
+| MatchStereo-S | 25.2M  | 1536x1536    | 0.98T | 1.73G   | 45ms    | [Hugging Face](https://huggingface.co/Tingman/MatchAttention/blob/main/matchstereo_small_fsd.pth)         |
+| MatchStereo-B | 75.5M  | 1536x1536    | 3.59T | 2.94G   | 75ms    | [Hugging Face](https://huggingface.co/Tingman/MatchAttention/blob/main/matchstereo_base_fsd.pth)          |
+| MatchFlow-B   | 75.5M  | 1536x1536    | 3.60T | 3.22G   | 77ms    | [Hugging Face](https://huggingface.co/Tingman/MatchAttention/blob/main/matchflow_base_sintel.pth)         |
+## Sample Usage
+To get started with inference using the provided scripts, follow the instructions on the [GitHub repository](https://github.com/TingmanYan/MatchAttention).
+Here's an example for running stereo matching on custom images from the command line:
+```bash
+# Clone the repository and install dependencies first as per the GitHub README.
+# Then navigate to the MatchAttention directory.
+# Run stereo matching on custom images
+python run_img.py --img0_dir images/left/ --img1_dir images/right/ --output_path outputs --checkpoint_path checkpoints/matchstereo_tiny_fsd.pth --no_compile
+```
+For other tasks like optical flow, testing on benchmarks, or running the local Gradio demo, refer to the [GitHub repository](https://github.com/TingmanYan/MatchAttention).
+## Citation
+If you find our work helpful, please cite our paper:
+```bibtex
+@article{yan2025matchattention,
+  title={MatchAttention: Matching the Relative Positions for High-Resolution Cross-View Matching},
+  author={Tingman Yan and Tao Liu and Xilian Yang and Qunfei Zhao and Zeyang Xia},
+  journal={arXiv preprint arXiv:2510.14260},
+  year={2025}
+}
+```
+## Acknowledgement
+We would like to thank the authors of [UniMatch](https://github.com/autonomousvision/unimatch), [RAFT-Stereo](https://github.com/princeton-vl/RAFT-Stereo), [MetaFormer](https://github.com/sail-sg/metaformer), and [TransNeXt](https://github.com/DaiShiResearch/TransNeXt) for their code release. Thanks to the author of [FoundationStereo](https://github.com/NVlabs/FoundationStereo) for the release of the FSD dataset.
+## Contact
+Please reach out to [Tingman Yan](mailto:tingmanyan@dlut.edu.cn) for questions.