YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2-VL-2B Pagoda Fine-tuning with LoRA

Fine-tuning the Qwen2-VL-2B-Instruct multimodal vision-language model on the Pagoda dataset using LoRA (Low-Rank Adaptation).

🎯 Project Overview

This project demonstrates efficient fine-tuning of a state-of-the-art multimodal model using:

πŸš€ Features

  • βœ… Multimodal Training: Fine-tunes both vision encoder and language decoder
  • βœ… Memory Efficient: Uses 4-bit quantization and LoRA adapters
  • βœ… Production Ready: Complete training pipeline with evaluation and testing
  • βœ… Well Documented: Comprehensive model card and training logs
  • βœ… Easy to Use: Single Jupyter notebook with all code

πŸ“ Project Structure

qwen2-vl-pagoda-lora/
β”œβ”€β”€ qwen3_vl_lora_finetuning.ipynb  # Main training notebook
β”œβ”€β”€ README.md                        # This file
β”œβ”€β”€ LICENSE                          # Apache 2.0 License
β”œβ”€β”€ .gitignore                       # Git ignore file
└── qwen3-vl-2b-pagoda-lora/        # Output directory (after training)
    β”œβ”€β”€ adapter_config.json          # LoRA configuration
    β”œβ”€β”€ adapter_model.safetensors    # LoRA weights
    β”œβ”€β”€ README.md                    # Model card
    └── ...                          # Other model files

πŸ–₯️ Hardware Requirements

Minimum (with 4-bit quantization):

  • GPU: 8GB+ VRAM (e.g., RTX 3060, T4)
  • RAM: 16GB+
  • Storage: 20GB free space

Recommended (this project):

  • GPU: NVIDIA H200 (140.4 GB VRAM)
  • CPU: Intel Xeon Platinum 8568Y+ (96 cores)
  • RAM: 387 GB
  • Storage: NVMe SSD (3.2TB)

πŸ“Š Training Details

Parameter Value
Base Model Qwen2-VL-2B-Instruct
Dataset 1000 samples from Pagoda dataset
Train/Val Split 900 / 100 (90% / 10%)
LoRA Rank 8
LoRA Alpha 16
Batch Size 1 (effective: 8 with gradient accumulation)
Learning Rate 2e-4
Epochs 1
Optimizer PagedAdamW 8-bit
Precision bfloat16
Training Time ~15-20 minutes

πŸ”§ Installation

Prerequisites

  • Python 3.8+
  • CUDA 11.8+ (for GPU training)
  • Git

Setup

  1. Clone the repository:
git clone https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora.git
cd qwen2-vl-pagoda-lora
  1. Install dependencies:
pip install -r requirements.txt

Or install manually:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate datasets peft bitsandbytes pillow qwen-vl-utils
  1. Login to HuggingFace:
huggingface-cli login

πŸŽ“ Usage

Training

  1. Open the notebook:
jupyter notebook qwen3_vl_lora_finetuning.ipynb
  1. Run all cells sequentially to:
    • Install dependencies
    • Load the dataset
    • Configure LoRA
    • Train the model
    • Evaluate and test
    • Upload to HuggingFace Hub

Inference

After training, use the fine-tuned model:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image

# Load model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map="auto",
    trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, "./qwen3-vl-2b-pagoda-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)

# Process image
image = Image.open("your_image.jpg")
conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."}
    ]
}]

text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[[image]], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(output, skip_special_tokens=True)[0])

πŸ“ˆ Results

Training Metrics

  • Training Loss: (to be updated after training)
  • Validation Loss: (to be updated after training)
  • Training Steps: ~113 steps

Memory Usage

  • GPU VRAM: ~6.7 GB / 140.4 GB
  • RAM: ~3 GB / 387 GB
  • Model Size: ~400 MB (LoRA adapters only)

🌟 Key Optimizations

  1. 4-bit Quantization: Reduces model memory by 75%
  2. LoRA Adapters: Only trains ~1% of parameters
  3. Gradient Checkpointing: Reduces activation memory
  4. Image Resizing: Limits to 280Γ—280px to reduce vision tokens
  5. Text Truncation: Limits to 50 characters for memory efficiency
  6. Batch Size 1: Minimal memory footprint with gradient accumulation

πŸ”— Links

πŸ“ License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The base model (Qwen2-VL-2B-Instruct) is also licensed under Apache 2.0.

πŸ™ Acknowledgments

  • Qwen Team for the excellent Qwen2-VL-2B-Instruct model
  • HuggingFace for the Transformers and PEFT libraries
  • nojiyoon for the Pagoda dataset
  • Vast.ai for providing H200 GPU infrastructure

πŸ“š Citation

If you use this project in your research, please cite:

@misc{qwen2vl-pagoda-lora,
  author = {Your Name},
  title = {Fine-tuning Qwen2-VL-2B on Pagoda Dataset with LoRA},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora}}
}

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“§ Contact

For questions or issues, please open an issue on GitHub or contact [your email].


Note: This is a demonstration project for educational purposes. The model was trained on a limited dataset (1000 samples) for 1 epoch. For production use, consider training on more data for multiple epochs.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support