PaliGemma 2 LoRA Adapter for Multi-Modal Hateful Content Classification

🎯 Model Overview

This is a LoRA (Low-Rank Adaptation) adapter fine-tuned on top of google/paligemma2-3b-pt-224 for multi-label hateful content detection on paired text + image data using the MMHS150K dataset.

✨ Key Features

Multi-Modal Understanding: Processes both text and images simultaneously for context-aware classification
Multi-Label Classification: Can detect multiple types of hate speech in a single sample
Generative Approach: Uses generative classification instead of traditional classification heads
Efficient Fine-Tuning: LoRA adapter with only ~24MB of trainable parameters
JSON Output: Generates structured JSON arrays for easy downstream processing

This model uses generative classification: instead of training a dedicated classification head, the model generates a strict JSON array of labels (e.g., ["racist", "sexist"]).

Model Details

Model Description

Given an image and its associated text, the model outputs a JSON array containing zero or more labels from a fixed label set. The model is trained to classify hateful memes and social media content into multiple hate speech categories.

High-level flow:

Build a strict "return JSON only" prompt listing allowed labels.
Feed (text + image) to the VLM.
Generate a short response.
Parse the first JSON array found (best-effort JSON extraction).
Convert labels into multi-hot predictions and compute multi-label metrics.

Property	Value
Developed by	Amirhossein Yousefi
Model type	Vision-Language Model (VLM) with LoRA adapter
Language(s)	English
License	MIT
Base model	google/paligemma2-3b-pt-224
Parameters (Base)	3B
Parameters (Adapter)	~24MB
Input	Text + Image (224×224)
Output	JSON array of hate speech labels

Model Sources

Resource	Link
Repository	github.com/amirhossein-yousefi/text_image_multi_modal_vlm
Base Model	google/paligemma2-3b-pt-224
Dataset	MMHS150K

🏷️ Label Classes

The model classifies content into the following 5 hate speech categories:

Label	Description	Examples
`racist`	Content with racial discrimination	Slurs, stereotypes, dehumanization based on race/ethnicity
`sexist`	Content with gender-based discrimination	Misogyny, gender stereotypes, harassment based on gender
`homophobe`	Content with anti-LGBTQ+ discrimination	Slurs, stereotypes targeting LGBTQ+ individuals
`religion`	Content with religious discrimination	Attacks on religious groups, religious stereotypes
`otherhate`	Other forms of hateful content	Hate not covered by above categories

Uses

✅ Direct Use

This model is intended for detecting and classifying hateful content in multimodal (text + image) social media posts, memes, and similar content. It can be used for:

Content moderation systems - Automated flagging of potentially harmful content
Research on hate speech detection - Academic studies on multi-modal hate speech
Social media analysis - Understanding patterns of hateful content
Dataset annotation assistance - Semi-automated labeling of hate speech datasets
Educational purposes - Understanding how VLMs can be applied to content moderation

⚠️ Out-of-Scope Use

Production moderation without human review: This model should not be the sole decision-maker for content removal.
Non-English content: The model is trained on English data only.
Single-modality analysis: Best results are achieved with both text and image inputs.
Real-time high-stakes decisions: The model may produce errors and should not be used for legal or high-stakes decisions without human oversight.
Surveillance or censorship: This model should not be used for mass surveillance or unjust censorship.

Bias, Risks, and Limitations

Known Limitations

Dataset Bias: The model is trained on MMHS150K dataset which may contain biases present in the original annotations.
Cultural Context: Performance may vary across different types of hateful content and cultural contexts.
Error Rate: The model may produce false positives/negatives and should be used with human oversight.
JSON Parsing: Generated JSON output may occasionally be malformed and require robust parsing.
Temporal Bias: The model may not recognize new slurs, memes, or evolving hate speech patterns.
Image Quality: Performance may degrade on low-quality, distorted, or heavily edited images.

Recommendations

✅ Always use human review for critical content moderation decisions.
✅ Validate model outputs against your specific use case before deployment.
✅ Consider the cultural and contextual limitations of the training data.
✅ Implement robust JSON parsing with fallback mechanisms.
✅ Regularly evaluate model performance on new data distributions.
✅ Combine with other moderation signals for production systems.

🚀 How to Get Started with the Model

Installation

pip install transformers peft torch pillow accelerate

Quick Start - Load the Model

from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
import torch

# Model identifiers
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"

# Load the base model
base_model = AutoModelForImageTextToText.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto"  # or "cpu" for CPU-only inference
)

# Load the LoRA adapter
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)

# Load the processor
processor = AutoProcessor.from_pretrained(BASE_MODEL)

print("✅ Model loaded successfully!")

Full Inference Example

import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

# Load base model and adapter
BASE_MODEL = "google/paligemma2-3b-pt-224"
LORA_ADAPTER = "Amirhossein75/paligemma2-3b-mmhs150k-lora"

processor = AutoProcessor.from_pretrained(BASE_MODEL)
base_model = AutoModelForImageTextToText.from_pretrained(
    BASE_MODEL,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, LORA_ADAPTER)

# Prepare input
image = Image.open("path/to/image.jpg").convert("RGB")
text = "Some text to analyze"

# Create prompt
class_names = ["racist", "sexist", "homophobe", "religion", "otherhate"]
prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"

# Generate
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=64)
result = processor.decode(outputs[0], skip_special_tokens=True)
print(result)  # e.g., ["racist", "sexist"]

Note: This is a LoRA adapter and requires loading the base model first. You cannot use AutoModel.from_pretrained() directly on the adapter.

Batch Inference

import json
import re

def parse_json_labels(response: str) -> list:
    """Extract JSON array from model response with fallback."""
    try:
        # Try to find JSON array in response
        match = re.search(r'\[.*?\]', response)
        if match:
            return json.loads(match.group())
    except json.JSONDecodeError:
        pass
    return []

def classify_batch(model, processor, images, texts, class_names):
    """Classify a batch of image-text pairs."""
    results = []
    for image, text in zip(images, texts):
        prompt = f"Classify the following text and image into zero or more of these labels: {class_names}. Return ONLY a JSON array of applicable labels. Text: {text}"
        inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
        outputs = model.generate(**inputs, max_new_tokens=64)
        response = processor.decode(outputs[0], skip_special_tokens=True)
        labels = parse_json_labels(response)
        results.append(labels)
    return results

Training Details

📊 Training Data

MMHS150K (Multi-Modal Hate Speech) - A large-scale dataset for multi-modal hate speech detection containing ~150K tweets with associated images.

Split	Samples	Description
Train	~135,000	Training samples
Validation	5,000	Validation samples
Test	~10,000	Held-out test samples

Dataset structure:

train.csv, val.csv, test.csv with columns: text, image_path, labels
Labels are multi-hot encoded for: racist, sexist, homophobe, religion, otherhate

Data Source: Twitter/X posts with associated images, annotated for hate speech categories.

Training Procedure

🖥️ Hardware Used

Component	Specification
GPU	NVIDIA A100 (40GB/80GB HBM2e)
Platform	Google Colab Pro
GPU Memory	40GB+
Precision	bf16 (Brain Float 16) mixed precision
CUDA Version	11.8+

Note: The NVIDIA A100 is a data center GPU based on the Ampere architecture, offering 40GB or 80GB of HBM2e memory with 1.6TB/s bandwidth. It provides excellent performance for large VLM fine-tuning tasks.

⚙️ Training Hyperparameters

Parameter	Value
Training regime	bf16 mixed precision
Optimizer	AdamW
Learning rate	2e-4
Batch size	4 (with gradient accumulation)
Epochs	1
Max sequence length	512
Warmup steps	100

🔧 LoRA Configuration

Parameter	Value
LoRA rank (r)	4
LoRA alpha	32
LoRA dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Task type	CAUSAL_LM
Bias	none
Trainable parameters	~24MB

⏱️ Training Time & Throughput

Metric	Value
Validation time	458.13s (0:07:38)
Validation throughput	10.914 samples/s
Epochs completed	1.0
Final validation loss	0.3525

📈 Evaluation

Testing Data, Factors & Metrics

Testing Data

Dataset	Samples	Description
Validation set	5,000	MMHS150K validation split
Test set	~10,000	MMHS150K test split

Metrics Explained

Metric	Description	Interpretation
F1 Micro	Micro-averaged F1 score across all labels	Higher is better. Gives equal weight to each sample.
F1 Macro	Macro-averaged F1 score (unweighted mean)	Higher is better. Gives equal weight to each class.
Subset Accuracy	Exact match accuracy	Higher is better. All labels must match exactly.
Hamming Loss	Fraction of incorrectly predicted labels	Lower is better. Measures per-label errors.

📊 Results

This Model's Performance

Split	F1 Micro	F1 Macro	Subset Accuracy	Hamming Loss
Validation	0.5378	0.5000	0.4338	0.1422
Test	0.5404	0.4896	–	–

Comparison with Other Models in the Project

Model	Hardware	Split	F1 Micro	F1 Macro	Subset Acc	Hamming Loss
Qwen2-VL 2B + LoRA	RTX 3080 (16GB)	Validation	0.6172	0.5077	0.4366	0.14276
PaliGemma 2 3B + LoRA (this model)	A100	Validation	0.5378	0.5000	0.4338	0.14220
Qwen2-VL 2B + LoRA	RTX 3080 (16GB)	Test	0.6110	0.4992	–	–
PaliGemma 2 3B + LoRA (this model)	A100	Test	0.5404	0.4896	–	–

Note: The Qwen2-VL model was trained on a local Windows machine with NVIDIA GeForce RTX 3080 Laptop GPU (16GB VRAM), NVIDIA driver 581.57, and CUDA 13.0.

🔧 Technical Specifications

Model Architecture and Objective

Component	Description
Base Model	PaliGemma 2 (3B parameters) - a vision-language model by Google
Architecture	Transformer-based VLM with SigLIP vision encoder
Vision Encoder	SigLIP-So400m/14
Text Decoder	Gemma 2B
Image Resolution	224 × 224 pixels
Adapter	LoRA (Low-Rank Adaptation)
Objective	Generative multi-label classification via JSON array generation

Compute Infrastructure

Hardware

Component	Training	Inference (Recommended)
GPU	NVIDIA A100 (40GB)	Any GPU with 8GB+ VRAM
Platform	Google Colab Pro	Local / Cloud
Precision	bf16	fp16 / bf16
Memory	40GB+ GPU RAM	8GB+ GPU RAM

Software

Package	Version
Python	3.8+
Transformers	4.40+
PEFT	0.17.1
PyTorch	2.0+
Accelerate	0.27+
Pillow	9.0+

📚 Citation

If you use this model, please cite:

BibTeX:

@misc{yousefi2024paligemma-hatespeech,
  author = {Yousefi, Amirhossein},
  title = {Multi-Modal Vision-Language Models for Hateful Content Classification},
  year = {2024},
  publisher = {GitHub},
  howpublished = {\url{https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm}},
  note = {PaliGemma 2 LoRA adapter for MMHS150K hate speech detection}
}

APA:

Yousefi, A. (2024). Multi-Modal Vision-Language Models for Hateful Content Classification. GitHub. https://github.com/amirhossein-yousefi/text_image_multi_modal_vlm