YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen2-VL-2B Pagoda Fine-tuning with LoRA
Fine-tuning the Qwen2-VL-2B-Instruct multimodal vision-language model on the Pagoda dataset using LoRA (Low-Rank Adaptation).
π― Project Overview
This project demonstrates efficient fine-tuning of a state-of-the-art multimodal model using:
- Model: Qwen2-VL-2B-Instruct
- Dataset: Pagoda Text-and-Image Dataset
- Method: LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
- Quantization: 4-bit NF4 quantization for memory efficiency
π Features
- β Multimodal Training: Fine-tunes both vision encoder and language decoder
- β Memory Efficient: Uses 4-bit quantization and LoRA adapters
- β Production Ready: Complete training pipeline with evaluation and testing
- β Well Documented: Comprehensive model card and training logs
- β Easy to Use: Single Jupyter notebook with all code
π Project Structure
qwen2-vl-pagoda-lora/
βββ qwen3_vl_lora_finetuning.ipynb # Main training notebook
βββ README.md # This file
βββ LICENSE # Apache 2.0 License
βββ .gitignore # Git ignore file
βββ qwen3-vl-2b-pagoda-lora/ # Output directory (after training)
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA weights
βββ README.md # Model card
βββ ... # Other model files
π₯οΈ Hardware Requirements
Minimum (with 4-bit quantization):
- GPU: 8GB+ VRAM (e.g., RTX 3060, T4)
- RAM: 16GB+
- Storage: 20GB free space
Recommended (this project):
- GPU: NVIDIA H200 (140.4 GB VRAM)
- CPU: Intel Xeon Platinum 8568Y+ (96 cores)
- RAM: 387 GB
- Storage: NVMe SSD (3.2TB)
π Training Details
| Parameter | Value |
|---|---|
| Base Model | Qwen2-VL-2B-Instruct |
| Dataset | 1000 samples from Pagoda dataset |
| Train/Val Split | 900 / 100 (90% / 10%) |
| LoRA Rank | 8 |
| LoRA Alpha | 16 |
| Batch Size | 1 (effective: 8 with gradient accumulation) |
| Learning Rate | 2e-4 |
| Epochs | 1 |
| Optimizer | PagedAdamW 8-bit |
| Precision | bfloat16 |
| Training Time | ~15-20 minutes |
π§ Installation
Prerequisites
- Python 3.8+
- CUDA 11.8+ (for GPU training)
- Git
Setup
- Clone the repository:
git clone https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora.git
cd qwen2-vl-pagoda-lora
- Install dependencies:
pip install -r requirements.txt
Or install manually:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate datasets peft bitsandbytes pillow qwen-vl-utils
- Login to HuggingFace:
huggingface-cli login
π Usage
Training
- Open the notebook:
jupyter notebook qwen3_vl_lora_finetuning.ipynb
- Run all cells sequentially to:
- Install dependencies
- Load the dataset
- Configure LoRA
- Train the model
- Evaluate and test
- Upload to HuggingFace Hub
Inference
After training, use the fine-tuned model:
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image
# Load model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-VL-2B-Instruct",
device_map="auto",
trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, "./qwen3-vl-2b-pagoda-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)
# Process image
image = Image.open("your_image.jpg")
conversation = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image."}
]
}]
text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[[image]], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(output, skip_special_tokens=True)[0])
π Results
Training Metrics
- Training Loss: (to be updated after training)
- Validation Loss: (to be updated after training)
- Training Steps: ~113 steps
Memory Usage
- GPU VRAM: ~6.7 GB / 140.4 GB
- RAM: ~3 GB / 387 GB
- Model Size: ~400 MB (LoRA adapters only)
π Key Optimizations
- 4-bit Quantization: Reduces model memory by 75%
- LoRA Adapters: Only trains ~1% of parameters
- Gradient Checkpointing: Reduces activation memory
- Image Resizing: Limits to 280Γ280px to reduce vision tokens
- Text Truncation: Limits to 50 characters for memory efficiency
- Batch Size 1: Minimal memory footprint with gradient accumulation
π Links
- Base Model: Qwen/Qwen2-VL-2B-Instruct
- Dataset: nojiyoon/pagoda-text-and-image-dataset-small
- Fine-tuned Model: (link to your HuggingFace model)
- PEFT Library: huggingface/peft
π License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
The base model (Qwen2-VL-2B-Instruct) is also licensed under Apache 2.0.
π Acknowledgments
- Qwen Team for the excellent Qwen2-VL-2B-Instruct model
- HuggingFace for the Transformers and PEFT libraries
- nojiyoon for the Pagoda dataset
- Vast.ai for providing H200 GPU infrastructure
π Citation
If you use this project in your research, please cite:
@misc{qwen2vl-pagoda-lora,
author = {Your Name},
title = {Fine-tuning Qwen2-VL-2B on Pagoda Dataset with LoRA},
year = {2025},
publisher = {GitHub},
howpublished = {\url{https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora}}
}
π€ Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
π§ Contact
For questions or issues, please open an issue on GitHub or contact [your email].
Note: This is a demonstration project for educational purposes. The model was trained on a limited dataset (1000 samples) for 1 epoch. For production use, consider training on more data for multiple epochs.