YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2-VL-2B Pagoda Fine-tuning with LoRA

Fine-tuning the Qwen2-VL-2B-Instruct multimodal vision-language model on the Pagoda dataset using LoRA (Low-Rank Adaptation).

🎯 Project Overview

This project demonstrates efficient fine-tuning of a state-of-the-art multimodal model using:

Model: Qwen2-VL-2B-Instruct
Dataset: Pagoda Text-and-Image Dataset
Method: LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
Quantization: 4-bit NF4 quantization for memory efficiency

🚀 Features

✅ Multimodal Training: Fine-tunes both vision encoder and language decoder
✅ Memory Efficient: Uses 4-bit quantization and LoRA adapters
✅ Production Ready: Complete training pipeline with evaluation and testing
✅ Well Documented: Comprehensive model card and training logs
✅ Easy to Use: Single Jupyter notebook with all code

📁 Project Structure

qwen2-vl-pagoda-lora/
├── qwen3_vl_lora_finetuning.ipynb  # Main training notebook
├── README.md                        # This file
├── LICENSE                          # Apache 2.0 License
├── .gitignore                       # Git ignore file
└── qwen3-vl-2b-pagoda-lora/        # Output directory (after training)
    ├── adapter_config.json          # LoRA configuration
    ├── adapter_model.safetensors    # LoRA weights
    ├── README.md                    # Model card
    └── ...                          # Other model files

🖥️ Hardware Requirements

Minimum (with 4-bit quantization):

GPU: 8GB+ VRAM (e.g., RTX 3060, T4)
RAM: 16GB+
Storage: 20GB free space

Recommended (this project):

GPU: NVIDIA H200 (140.4 GB VRAM)
CPU: Intel Xeon Platinum 8568Y+ (96 cores)
RAM: 387 GB
Storage: NVMe SSD (3.2TB)

📊 Training Details

Parameter	Value
Base Model	Qwen2-VL-2B-Instruct
Dataset	1000 samples from Pagoda dataset
Train/Val Split	900 / 100 (90% / 10%)
LoRA Rank	8
LoRA Alpha	16
Batch Size	1 (effective: 8 with gradient accumulation)
Learning Rate	2e-4
Epochs	1
Optimizer	PagedAdamW 8-bit
Precision	bfloat16
Training Time	~15-20 minutes

🔧 Installation

Prerequisites

Python 3.8+
CUDA 11.8+ (for GPU training)
Git

Setup

Clone the repository:

git clone https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora.git
cd qwen2-vl-pagoda-lora

Install dependencies:

pip install -r requirements.txt

Or install manually:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers accelerate datasets peft bitsandbytes pillow qwen-vl-utils

Login to HuggingFace:

huggingface-cli login

🎓 Usage

Training

Open the notebook:

jupyter notebook qwen3_vl_lora_finetuning.ipynb

Run all cells sequentially to:
- Install dependencies
- Load the dataset
- Configure LoRA
- Train the model
- Evaluate and test
- Upload to HuggingFace Hub

Inference

After training, use the fine-tuned model:

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
from PIL import Image

# Load model
base_model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-2B-Instruct",
    device_map="auto",
    trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, "./qwen3-vl-2b-pagoda-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", trust_remote_code=True)

# Process image
image = Image.open("your_image.jpg")
conversation = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Describe this image."}
    ]
}]

text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[[image]], return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(output, skip_special_tokens=True)[0])

📈 Results

Training Metrics

Training Loss: (to be updated after training)
Validation Loss: (to be updated after training)
Training Steps: ~113 steps

Memory Usage

GPU VRAM: ~6.7 GB / 140.4 GB
RAM: ~3 GB / 387 GB
Model Size: ~400 MB (LoRA adapters only)

🌟 Key Optimizations

4-bit Quantization: Reduces model memory by 75%
LoRA Adapters: Only trains ~1% of parameters
Gradient Checkpointing: Reduces activation memory
Image Resizing: Limits to 280×280px to reduce vision tokens
Text Truncation: Limits to 50 characters for memory efficiency
Batch Size 1: Minimal memory footprint with gradient accumulation

🔗 Links

Base Model: Qwen/Qwen2-VL-2B-Instruct
Dataset: nojiyoon/pagoda-text-and-image-dataset-small
Fine-tuned Model: (link to your HuggingFace model)
PEFT Library: huggingface/peft

📝 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

The base model (Qwen2-VL-2B-Instruct) is also licensed under Apache 2.0.

🙏 Acknowledgments

Qwen Team for the excellent Qwen2-VL-2B-Instruct model
HuggingFace for the Transformers and PEFT libraries
nojiyoon for the Pagoda dataset
Vast.ai for providing H200 GPU infrastructure

📚 Citation

If you use this project in your research, please cite:

@misc{qwen2vl-pagoda-lora,
  author = {Your Name},
  title = {Fine-tuning Qwen2-VL-2B on Pagoda Dataset with LoRA},
  year = {2025},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/YOUR_USERNAME/qwen2-vl-pagoda-lora}}
}

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📧 Contact

For questions or issues, please open an issue on GitHub or contact [your email].

Note: This is a demonstration project for educational purposes. The model was trained on a limited dataset (1000 samples) for 1 epoch. For production use, consider training on more data for multiple epochs.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support