FirstSight: Distilled Qwen2-VL-2B for Efficient Egocentric QA

Model Description

FirstSight is a knowledge-distilled vision-language model optimized for efficient egocentric question answering on edge devices. This model is distilled from Qwen2-VL-7B-Instruct using advanced distillation techniques, achieving 3.75× compression with minimal performance degradation.

Key Highlights

🚀 5.16× faster inference than teacher model
💾 67.9% VRAM reduction (9.39 GB savings)
📦 73.4% smaller model size (2.21B vs 8.29B parameters)
⚡ Optimized for edge deployment on resource-constrained devices
🎯 Specialized for egocentric scenarios (first-person perspective)

Model Architecture

Base Model: Qwen2-VL-2B-Instruct
Teacher Model: Qwen2-VL-7B-Instruct
Student Parameters: 2.21B
Precision: BFloat16 mixed precision
Distillation Method: Logit-based knowledge distillation with KL divergence

Training Details

Training Data

Dataset: Synthetic egocentric QA dataset with 5,000 training samples
Validation Set: 1,000 samples
Question Types: Object recognition, spatial reasoning, action understanding, temporal queries, environment understanding
Scenarios: Kitchen, living room, office, outdoor, workshop

Training Procedure

Framework: PyTorch with Hugging Face Transformers
Epochs: 10
Batch Size: 2 per GPU with 4× gradient accumulation
Learning Rate: 1e-5 (AdamW optimizer)
Scheduler: Cosine annealing with 100 warmup steps
Loss Function: Weighted combination of distillation loss (α=0.7) and hard label loss (α=0.3)
Temperature: 2.0 for knowledge distillation
Hardware: NVIDIA Quadro RTX 8000 (48GB)
Training Time: ~4 hours

Training Hyperparameters

{
    "learning_rate": 1e-5,
    "optimizer": "AdamW",
    "weight_decay": 0.01,
    "gradient_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "warmup_steps": 100,
    "temperature": 2.0,
    "alpha": 0.7,
    "epochs": 10
}

Performance Metrics

Inference Speed

Metric	Teacher (7B)	Student (2B)	Improvement
Avg Latency	1.260s	0.244s	5.16×
Throughput (samples/s)	0.79	4.10	5.16×
Throughput (tokens/s)	34.33	162.55	5.16×

Memory Usage

Metric	Teacher (7B)	Student (2B)	Savings
Model Size	8.29B params	2.21B params	73.4%
Peak VRAM	13.81 GB	4.43 GB	9.39 GB
VRAM Reduction	-	-	67.9%

Model Compression

Compression Ratio: 3.75×
Parameter Reduction: 73.4%
From: 8.29B parameters
To: 2.21B parameters

Usage

Installation

pip install transformers torch pillow qwen-vl-utils

Inference Example

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_name = "YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Prepare image and question
image = Image.open("egocentric_image.jpg")
question = "What object am I holding in my right hand?"

# Create conversation template
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": f"Question: {question}\nAnswer concisely:"}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False
    )

# Decode response
response = processor.batch_decode(
    outputs[:, inputs['input_ids'].shape[1]:],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(f"Answer: {response}")

Intended Use

Primary Use Cases

Egocentric Visual Question Answering: Answer questions about first-person perspective images/videos
Edge Device Deployment: Run VLM inference on resource-constrained hardware (mobile, IoT, AR/VR)
Real-time Assistive Systems: Power low-latency visual assistants for wearable cameras
Smart Glasses Applications: Enable efficient VLM capabilities on AR/VR headsets

Supported Question Types

Object Recognition: "What object did I just pick up?"
Spatial Reasoning: "Where is the nearest door?"
Action Understanding: "What action am I performing?"
Temporal Queries: "What was I looking at 5 seconds ago?"
Environment Understanding: "What room am I in?"
Counting: "How many items are on the table?"
Attribute Recognition: "What color is the object I'm holding?"

Limitations

Model is specialized for egocentric scenarios and may perform worse on third-person images
Trained on synthetic data - real-world performance may vary
No multimodal training - relies solely on knowledge distillation
May inherit biases from the teacher model (Qwen2-VL-7B)
Limited to short-form QA - not optimized for long conversations

Ethical Considerations

Privacy: Egocentric images often contain sensitive personal information. Ensure proper consent and data protection.
Bias: Model may exhibit biases from training data and teacher model. Evaluate on diverse datasets.
Misuse: Could be used for unauthorized surveillance. Deploy responsibly with user consent.

Citation

If you use this model in your research, please cite:

@misc{firstsight2024,
  title={FirstSight: Efficient Knowledge Distillation for Vision-Language Models on Edge Devices},
  author={NYU HPML Project Team},
  year={2024},
  howpublished={\url{https://huggingface.co/YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled}},
  note={Distilled from Qwen2-VL-7B-Instruct for egocentric question answering}
}

Model Card Authors

NYU High Performance Machine Learning (HPML) Project Team

Model Card Contact

For questions or feedback, please open an issue on the GitHub repository.

Training Date: December 8-9, 2024
Evaluation Date: 2025-12-09T00:32:52.669538
Framework: PyTorch 2.3.0, Transformers 4.57.3, BitsAndBytes 0.48.2

Downloads last month: 1

Safetensors

Model size

2B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

Model Compression
self-reported

3.750
Inference Speedup
self-reported

5.160
VRAM Reduction (%)
self-reported

67.900