Model Card: RohitUltimate/Qwen3.5_VL_2B_12k

Description

This model is a fine-tuned vision-language model based on Qwen3.5-2B, optimized for image-text-to-text tasks with extended context length (12k tokens).

Compared to the base and standard fine-tuned variants, this model demonstrates improved performance on instruction-following and multimodal understanding, benefiting from higher-quality training data and better alignment for bank statement extraction.

It is designed to run efficiently on GPUs with under 8GB VRAM with less than 5GB model, enabling low-cost deployment without significant performance compromise.


vLLM Inference Pipeline

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. You can run this model using vLLM with the following Docker command:

docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest \
  --model RohitUltimate/Qwen3.5_VL_2B_12k \
  --huggingface_token <YOUR_HF_TOKEN> \
  --tokenizer Qwen/Qwen3.5-2B \
  --dtype bfloat16 \
  --trust-remote-code \
  --gpu-memory-utilization 0.9 \
  --max-model-len 12000

Discussion:

If you need more information, have suggestions, or face any issues while using this model, feel free to start a discussion.

Let’s collaborate and grow this community stronger

Downloads last month
305
Safetensors
Model size
2B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RohitUltimate/Qwen3.5_VL_2B_12k

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(110)
this model