| --- |
| language: |
| - en |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| tags: |
| - ocr |
| - vision-language |
| - qwen2-vl |
| - vila |
| - multimodal |
| license: apache-2.0 |
| --- |
| |
| # Easy DeepOCR - VILA-Qwen2-VL-8B |
|
|
| A vision-language model fine-tuned for OCR tasks, based on VILA architecture with Qwen2-VL-8B as the language backbone. |
|
|
| ## Model Description |
|
|
| This model combines: |
| - **Language Model**: Qwen2-VL-8B |
| - **Vision Encoders**: SAM + CLIP |
| - **Architecture**: VILA (Visual Language Adapter) |
| - **Task**: Optical Character Recognition (OCR) |
|
|
| ## Model Structure |
| ``` |
| easy_deepocr/ |
| βββ config.json # Model configuration |
| βββ llm/ # Qwen2-VL-8B language model weights |
| βββ mm_projector/ # Multimodal projection layer |
| βββ sam_clip_ckpt/ # SAM and CLIP vision encoder weights |
| βββ trainer_state.json # Training state information |
| ``` |
|
|
| ## Usage |
| ```python |
| # TODO: Add your inference code here |
| from transformers import AutoModel, AutoTokenizer |
| |
| model = AutoModel.from_pretrained("pkulium/easy_deepocr", trust_remote_code=True) |
| tokenizer = AutoTokenizer.from_pretrained("pkulium/easy_deepocr") |
| |
| # Example inference |
| # image = ... |
| # text = ... |
| ``` |
|
|
| ## Training Details |
|
|
| - **Base Model**: Qwen2-VL-8B |
| - **Vision Encoders**: SAM + CLIP |
| - **Training Framework**: VILA |
| - **Training Type**: Pretraining for OCR tasks |
|
|
| ## Intended Use |
|
|
| This model is designed for: |
| - Document OCR |
| - Scene text recognition |
| - Handwriting recognition |
| - Multi-language text extraction |
|
|
| ## Limitations |
|
|
| - [Add any known limitations] |
| - Model performance may vary with image quality |
| - Best suited for [specify use cases] |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
| ```bibtex |
| @misc{easy_deepocr, |
| author = {Ming Liu}, |
| title = {Easy DeepOCR - VILA-Qwen2-VL-8B}, |
| year = {2025}, |
| publisher = {HuggingFace}, |
| url = {https://huggingface.co/pkulium/easy_deepocr} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| - [VILA](https://github.com/NVlabs/VILA) for the architecture |
| - [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) for the language model |
| - SAM and CLIP for vision encoding capabilities |