pkulium
/

easy_deepocr

Image-Text-to-Text

vision-language

Model card Files Files and versions

easy_deepocr / README.md

pkulium's picture

Upload folder using huggingface_hub

dce86e8 verified 6 months ago

|

2.11 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: image-text-to-text
	tags:
	- ocr
	- vision-language
	- qwen2-vl
	- vila
	- multimodal
	license: apache-2.0
	---

	# Easy DeepOCR - VILA-Qwen2-VL-8B

	A vision-language model fine-tuned for OCR tasks, based on VILA architecture with Qwen2-VL-8B as the language backbone.

	## Model Description

	This model combines:
	- Language Model: Qwen2-VL-8B
	- Vision Encoders: SAM + CLIP
	- Architecture: VILA (Visual Language Adapter)
	- Task: Optical Character Recognition (OCR)

	## Model Structure
	```
	easy_deepocr/
	├── config.json # Model configuration
	├── llm/ # Qwen2-VL-8B language model weights
	├── mm_projector/ # Multimodal projection layer
	├── sam_clip_ckpt/ # SAM and CLIP vision encoder weights
	└── trainer_state.json # Training state information
	```

	## Usage
	```python
	# TODO: Add your inference code here
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("pkulium/easy_deepocr", trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained("pkulium/easy_deepocr")

	# Example inference
	# image = ...
	# text = ...
	```

	## Training Details

	- Base Model: Qwen2-VL-8B
	- Vision Encoders: SAM + CLIP
	- Training Framework: VILA
	- Training Type: Pretraining for OCR tasks

	## Intended Use

	This model is designed for:
	- Document OCR
	- Scene text recognition
	- Handwriting recognition
	- Multi-language text extraction

	## Limitations

	- [Add any known limitations]
	- Model performance may vary with image quality
	- Best suited for [specify use cases]

	## Citation

	If you use this model, please cite:
	```bibtex
	@misc{easy_deepocr,
	author = {Ming Liu},
	title = {Easy DeepOCR - VILA-Qwen2-VL-8B},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/pkulium/easy_deepocr}
	}
	```

	## Acknowledgments

	- [VILA](https://github.com/NVlabs/VILA) for the architecture
	- [Qwen2-VL](https://github.com/QwenLM/Qwen2-VL) for the language model
	- SAM and CLIP for vision encoding capabilities