Qwen2.5-Omni-3B for Indoor Scenes Test-Time Scaling

This repository contains the code for deploying Qwen2.5-Omni-3B as a Hugging Face Inference Endpoint, optimized for test-time scaling implementation with the MIT Indoor Scenes dataset (CVPR 2019).

Overview

LLaVA-Onevision implementation with Qwen2.5-Omni-3B provides powerful multimodal capabilities for:

Image captioning and understanding
Video analysis
Audio processing
Test-time scaling with budget parameters

This endpoint is specifically designed for research on test-time scaling techniques using the MIT Indoor Scenes dataset from CVPR 2019.

Features

Multimodal Input Support: Process images, videos, and audio
Test-Time Scaling: Implement budget scaling/forcing for controlled generation
Beam Search Integration: Configurable beam search parameters
Custom Performance Metrics: Specialized for indoor scene captioning tasks

Usage Examples

Basic Image Captioning

{
  "conversation": [
    {
      "role": "user",
      "content": "Describe this indoor scene in detail.",
      "images": ["https://example.com/indoor_scene.jpg"]
    }
  ]
}

With Test-Time Scaling Parameters

{
  "conversation": [
    {
      "role": "user",
      "content": "Describe this indoor scene in detail.",
      "images": ["https://example.com/indoor_scene.jpg"]
    }
  ],
  "test_time_settings": {
    "budget_scale": 1.2,
    "num_beams": 3
  }
}

Deployment Instructions

Create a Hugging Face Repository:

huggingface-cli login
huggingface-cli repo create your-username/qwen-omni-indoor-endpoint --type model

Initialize and Push:

cd qwen-omni-endpoint-fresh
git init
git add .
git commit -m "Initial commit"
git remote add origin https://huggingface.co/your-username/qwen-omni-indoor-endpoint
git push -u origin main

Deploy on Hugging Face:
- Navigate to your repository on Hugging Face
- Go to the "Deploy" tab
- Select "Inference Endpoints"
- Choose appropriate hardware (recommend at least 16GB GPU for 3B model)
- Deploy!

Implementation Details

The endpoint implements test-time scaling for LLaVA-Onevision with the following components:

Budget Scaling/Forcing: Controls the verbosity and detail level in the generated captions
Beam Search Integration: Improves caption quality through parallel hypothesis exploration
Performance Metrics: Specialized evaluation for indoor scene captioning accuracy

Hardware Requirements

For optimal performance with the 3B model:

GPU: NVIDIA T4 or better (16GB+ VRAM)
CPU: 4+ cores
RAM: 16GB+

References

Downloads last month: 1

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support