VIVID Docmatix - Multilingual Vision-Language Model

This model is a fine-tuned version of Qwen/Qwen3-VL-4B-Instruct on the VIVID Docmatix multilingual dataset.

Model Details

  • Base Model: Qwen/Qwen3-VL-4B-Instruct
  • Training Method: LoRA (Low-Rank Adaptation)
  • Languages: English (en), Kannada (kn), Hindi (hi)
  • Training Dataset: VIVID Docmatix (10k samples)
  • Checkpoint: checkpoint-50
  • Experiment: test_e2e/gemma3-4b_10k_r16a32_bs2

Training Configuration

The model was fine-tuned using:

  • LoRA Rank: Extracted from run name
  • Batch Size: 2
  • Max Length: 16384 tokens
  • Gradient Accumulation Steps: 4

Intended Use

This model is designed for:

  • Multilingual document understanding (English, Kannada, Hindi)
  • OCR and text extraction from images
  • Visual question answering on documents
  • Document layout analysis

Usage

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "v1v1d1/vivid_docmatix_gemma3_4b_en_kn_hi_10k_50",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("v1v1d1/vivid_docmatix_gemma3_4b_en_kn_hi_10k_50")

# Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "Extract all text from this document in Kannada."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Generate
with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=4096)
    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(inputs.input_ids, output_ids)
    ]
    output_text = processor.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]

print(output_text)

Training Details

This model was trained as part of the Nayana project for multilingual multimodal AI, focusing on underrepresented languages.

Framework

  • Training Framework: MS-Swift
  • Base Framework: PyTorch + Transformers

Limitations

  • Primarily trained on document images
  • Best performance on English, Kannada, and Hindi
  • May not generalize well to other languages or domains

Citation

@misc{nayana-vivid-docmatix,
  title={VIVID Docmatix: Multilingual Vision-Language Model},
  author={Nayana Team},
  year={2026},
  url={https://huggingface.co/v1v1d1/vivid_docmatix_gemma3_4b_en_kn_hi_10k_50}
}

License

Apache 2.0


Model trained using MS-Swift on Modal

Downloads last month
6
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for v1v1d1/vivid_docmatix_gemma3_4b_en_kn_hi_10k_50

Adapter
(27)
this model