Model Card: RohitUltimate/Qwen3.5_VL_2B_12k
Description
This model is a fine-tuned vision-language model based on Qwen3.5-2B, optimized for image-text-to-text tasks with extended context length (12k tokens).
Compared to the base and standard fine-tuned variants, this model demonstrates improved performance on instruction-following and multimodal understanding, benefiting from higher-quality training data and better alignment for bank statement extraction.
It is designed to run efficiently on GPUs with under 8GB VRAM with less than 5GB model, enabling low-cost deployment without significant performance compromise.
vLLM Inference Pipeline
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. You can run this model using vLLM with the following Docker command:
docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
--model RohitUltimate/Qwen3.5_VL_2B_12k \
--huggingface_token <YOUR_HF_TOKEN> \
--tokenizer Qwen/Qwen3.5-2B \
--dtype bfloat16 \
--trust-remote-code \
--gpu-memory-utilization 0.9 \
--max-model-len 12000
Discussion:
If you need more information, have suggestions, or face any issues while using this model, feel free to start a discussion.
Let’s collaborate and grow this community stronger
- Downloads last month
- 305