FirstSight: Distilled Qwen2-VL-2B for Efficient Egocentric QA

Model Description

FirstSight is a knowledge-distilled vision-language model optimized for efficient egocentric question answering on edge devices. This model is distilled from Qwen2-VL-7B-Instruct using advanced distillation techniques, achieving 3.75ร— compression with minimal performance degradation.

Key Highlights

  • ๐Ÿš€ 5.16ร— faster inference than teacher model
  • ๐Ÿ’พ 67.9% VRAM reduction (9.39 GB savings)
  • ๐Ÿ“ฆ 73.4% smaller model size (2.21B vs 8.29B parameters)
  • โšก Optimized for edge deployment on resource-constrained devices
  • ๐ŸŽฏ Specialized for egocentric scenarios (first-person perspective)

Model Architecture

  • Base Model: Qwen2-VL-2B-Instruct
  • Teacher Model: Qwen2-VL-7B-Instruct
  • Student Parameters: 2.21B
  • Precision: BFloat16 mixed precision
  • Distillation Method: Logit-based knowledge distillation with KL divergence

Training Details

Training Data

  • Dataset: Synthetic egocentric QA dataset with 5,000 training samples
  • Validation Set: 1,000 samples
  • Question Types: Object recognition, spatial reasoning, action understanding, temporal queries, environment understanding
  • Scenarios: Kitchen, living room, office, outdoor, workshop

Training Procedure

  • Framework: PyTorch with Hugging Face Transformers
  • Epochs: 10
  • Batch Size: 2 per GPU with 4ร— gradient accumulation
  • Learning Rate: 1e-5 (AdamW optimizer)
  • Scheduler: Cosine annealing with 100 warmup steps
  • Loss Function: Weighted combination of distillation loss (ฮฑ=0.7) and hard label loss (ฮฑ=0.3)
  • Temperature: 2.0 for knowledge distillation
  • Hardware: NVIDIA Quadro RTX 8000 (48GB)
  • Training Time: ~4 hours

Training Hyperparameters

{
    "learning_rate": 1e-5,
    "optimizer": "AdamW",
    "weight_decay": 0.01,
    "gradient_accumulation_steps": 4,
    "max_grad_norm": 1.0,
    "warmup_steps": 100,
    "temperature": 2.0,
    "alpha": 0.7,
    "epochs": 10
}

Performance Metrics

Inference Speed

Metric Teacher (7B) Student (2B) Improvement
Avg Latency 1.260s 0.244s 5.16ร—
Throughput (samples/s) 0.79 4.10 5.16ร—
Throughput (tokens/s) 34.33 162.55 5.16ร—

Memory Usage

Metric Teacher (7B) Student (2B) Savings
Model Size 8.29B params 2.21B params 73.4%
Peak VRAM 13.81 GB 4.43 GB 9.39 GB
VRAM Reduction - - 67.9%

Model Compression

  • Compression Ratio: 3.75ร—
  • Parameter Reduction: 73.4%
  • From: 8.29B parameters
  • To: 2.21B parameters

Usage

Installation

pip install transformers torch pillow qwen-vl-utils

Inference Example

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model_name = "YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled"
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_name)

# Prepare image and question
image = Image.open("egocentric_image.jpg")
question = "What object am I holding in my right hand?"

# Create conversation template
messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": f"Question: {question}\nAnswer concisely:"}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to(model.device)

# Generate answer
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=False
    )

# Decode response
response = processor.batch_decode(
    outputs[:, inputs['input_ids'].shape[1]:],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(f"Answer: {response}")

Intended Use

Primary Use Cases

  • Egocentric Visual Question Answering: Answer questions about first-person perspective images/videos
  • Edge Device Deployment: Run VLM inference on resource-constrained hardware (mobile, IoT, AR/VR)
  • Real-time Assistive Systems: Power low-latency visual assistants for wearable cameras
  • Smart Glasses Applications: Enable efficient VLM capabilities on AR/VR headsets

Supported Question Types

  1. Object Recognition: "What object did I just pick up?"
  2. Spatial Reasoning: "Where is the nearest door?"
  3. Action Understanding: "What action am I performing?"
  4. Temporal Queries: "What was I looking at 5 seconds ago?"
  5. Environment Understanding: "What room am I in?"
  6. Counting: "How many items are on the table?"
  7. Attribute Recognition: "What color is the object I'm holding?"

Limitations

  • Model is specialized for egocentric scenarios and may perform worse on third-person images
  • Trained on synthetic data - real-world performance may vary
  • No multimodal training - relies solely on knowledge distillation
  • May inherit biases from the teacher model (Qwen2-VL-7B)
  • Limited to short-form QA - not optimized for long conversations

Ethical Considerations

  • Privacy: Egocentric images often contain sensitive personal information. Ensure proper consent and data protection.
  • Bias: Model may exhibit biases from training data and teacher model. Evaluate on diverse datasets.
  • Misuse: Could be used for unauthorized surveillance. Deploy responsibly with user consent.

Citation

If you use this model in your research, please cite:

@misc{firstsight2024,
  title={FirstSight: Efficient Knowledge Distillation for Vision-Language Models on Edge Devices},
  author={NYU HPML Project Team},
  year={2024},
  howpublished={\url{https://huggingface.co/YOUR_USERNAME/firstsight-qwen2-vl-2b-distilled}},
  note={Distilled from Qwen2-VL-7B-Instruct for egocentric question answering}
}

Model Card Authors

NYU High Performance Machine Learning (HPML) Project Team

Model Card Contact

For questions or feedback, please open an issue on the GitHub repository.


Training Date: December 8-9, 2024
Evaluation Date: 2025-12-09T00:32:52.669538
Framework: PyTorch 2.3.0, Transformers 4.57.3, BitsAndBytes 0.48.2

Downloads last month
1
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results