--- language: en license: other datasets: - DeepMount00/ner_training tags: - vision - multimodal - OCR - SmolVLM pipeline_tag: text-generation --- # SmolVLM Base - OCR Fine-tuned This is a merged version of SmolVLM-Base fine-tuned for OCR tasks. The model was trained using QLoRA on the DeepMount00/ner_training dataset. ## Model Details - **Base Model**: HuggingFaceTB/SmolVLM-Base - **Task**: Optical Character Recognition (OCR) - **Training Method**: QLoRA with 4-bit quantization - **Target Modules**: down_proj, o_proj, k_proj, q_proj, gate_proj, up_proj, v_proj ## Usage ```python from transformers import AutoProcessor, Idefics3ForConditionalGeneration import torch from PIL import Image model_id = "DeepMount00/SmolVLM-Base-ocr_base" processor = AutoProcessor.from_pretrained(model_id) model = Idefics3ForConditionalGeneration.from_pretrained(model_id) # Load your image image = Image.open("path_to_your_image.jpg").convert("RGB") # Prepare the prompt messages = [ { "role": "user", "content": [ {"type": "text", "text": "You are a model specialized in OCR"}, {"type": "image"}, {"type": "text", "text": "Extract the text from this image"} ] } ] # Process inputs inputs = processor(text=messages, images=[image], return_tensors="pt") # Generate with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=512) # Decode and print the response print(processor.decode(outputs[0], skip_special_tokens=True)) ```