--- license: mit base_model: unsloth/Qwen2-VL-2B-Instruct tags: - vision-language-model - chart-qa - qwen2-vl - lora - finetuned - unsloth datasets: - weijiezz/chartqa_split_test language: - en pipeline_tag: image-text-to-text --- # 📊 alan-vlm — ChartQA Vision Language Model A finetuned version of **Qwen2-VL 2B Instruct** trained to answer natural language questions about charts and graphs. Finetuned on the [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) using [Unsloth](https://github.com/unslothai/unsloth) on a Google Colab free T4 GPU. --- ## 🧠 Model Details | | | |---|---| | **Base Model** | Qwen2-VL-2B-Instruct | | **Finetuning Method** | LoRA (r=8, alpha=8) | | **Training Data** | 2,000 chart QA pairs | | **Training Steps** | 500 | | **Batch Size** | 8 (2 per device × 4 gradient accumulation) | | **Trainable Parameters** | 9,232,384 (0.42% of total) | | **Precision** | fp16 | | **Hardware** | Google Colab T4 (15GB VRAM) | --- ## 🚀 Quick Start ```python from transformers import AutoProcessor, Qwen2VLForConditionalGeneration from PIL import Image import torch # Load model model = Qwen2VLForConditionalGeneration.from_pretrained( "alanjoshua2005/alan-vlm", torch_dtype=torch.float16, device_map="auto", ) processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm") # Run inference def ask(image_path, question): image = Image.open(image_path).convert("RGB") messages = [{"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question}, ]}] text_prompt = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=False, ) inputs = processor( text=text_prompt, images=image, return_tensors="pt" ) inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64) input_len = inputs["input_ids"].shape[1] return processor.decode(output[0][input_len:], skip_special_tokens=True) # Example answer = ask("chart.png", "What is the value of the highest bar?") print(answer) ``` --- ## 🎛️ Gradio Demo ```python import gradio as gr from transformers import AutoProcessor, Qwen2VLForConditionalGeneration from PIL import Image import torch model = Qwen2VLForConditionalGeneration.from_pretrained( "alanjoshua2005/alan-vlm", torch_dtype=torch.float16, device_map="auto", ) processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm") def answer_chart_question(image, question): if image is None or not question.strip(): return "Please provide both an image and a question." image = image.convert("RGB") messages = [{"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": question}, ]}] text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) inputs = processor(text=text_prompt, images=image, return_tensors="pt") inputs = {k: v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64) input_len = inputs["input_ids"].shape[1] return processor.decode(output[0][input_len:], skip_special_tokens=True) gr.Interface( fn=answer_chart_question, inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")], outputs=gr.Textbox(label="Answer"), title="📊 ChartQA - alan-vlm" ).launch() ``` --- ## 📦 Dataset Trained on [weijiezz/chartqa_split_test](https://huggingface.co/datasets/weijiezz/chartqa_split_test) — a 2,000 row dataset of chart images paired with questions and answers. Contains two types of questions: - `human_test` — questions written by human annotators - `augmented_test` — questions generated via data augmentation --- ## 🏋️ Training Details Training was done using **Unsloth** for optimized LoRA finetuning: ```python from unsloth import FastVisionModel model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Qwen2-VL-2B-Instruct", load_in_4bit=True, ) model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=8, lora_alpha=8, lora_dropout=0, bias="none", use_gradient_checkpointing="unsloth", target_modules=["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], ) ``` --- ## ⚠️ Limitations - Trained on only 2,000 samples — a learning/experimental project - May struggle with complex multi-series charts or heavily annotated graphs - Not evaluated on the full ChartQA benchmark yet - Best suited for simple bar, pie, and line chart questions --- ## 🙏 Acknowledgements - [Unsloth](https://github.com/unslothai/unsloth) for making VLM finetuning feasible on free Colab GPUs - [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) by Alibaba Cloud - [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) by weijiezz