YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen2.5-Omni-3B for Indoor Scenes Test-Time Scaling

This repository contains the code for deploying Qwen2.5-Omni-3B as a Hugging Face Inference Endpoint, optimized for test-time scaling implementation with the MIT Indoor Scenes dataset (CVPR 2019).

Overview

LLaVA-Onevision implementation with Qwen2.5-Omni-3B provides powerful multimodal capabilities for:

  • Image captioning and understanding
  • Video analysis
  • Audio processing
  • Test-time scaling with budget parameters

This endpoint is specifically designed for research on test-time scaling techniques using the MIT Indoor Scenes dataset from CVPR 2019.

Features

  • Multimodal Input Support: Process images, videos, and audio
  • Test-Time Scaling: Implement budget scaling/forcing for controlled generation
  • Beam Search Integration: Configurable beam search parameters
  • Custom Performance Metrics: Specialized for indoor scene captioning tasks

Usage Examples

Basic Image Captioning

{
  "conversation": [
    {
      "role": "user",
      "content": "Describe this indoor scene in detail.",
      "images": ["https://example.com/indoor_scene.jpg"]
    }
  ]
}

With Test-Time Scaling Parameters

{
  "conversation": [
    {
      "role": "user",
      "content": "Describe this indoor scene in detail.",
      "images": ["https://example.com/indoor_scene.jpg"]
    }
  ],
  "test_time_settings": {
    "budget_scale": 1.2,
    "num_beams": 3
  }
}

Deployment Instructions

  1. Create a Hugging Face Repository:

    huggingface-cli login
    huggingface-cli repo create your-username/qwen-omni-indoor-endpoint --type model
    
  2. Initialize and Push:

    cd qwen-omni-endpoint-fresh
    git init
    git add .
    git commit -m "Initial commit"
    git remote add origin https://huggingface.co/your-username/qwen-omni-indoor-endpoint
    git push -u origin main
    
  3. Deploy on Hugging Face:

    • Navigate to your repository on Hugging Face
    • Go to the "Deploy" tab
    • Select "Inference Endpoints"
    • Choose appropriate hardware (recommend at least 16GB GPU for 3B model)
    • Deploy!

Implementation Details

The endpoint implements test-time scaling for LLaVA-Onevision with the following components:

  1. Budget Scaling/Forcing: Controls the verbosity and detail level in the generated captions
  2. Beam Search Integration: Improves caption quality through parallel hypothesis exploration
  3. Performance Metrics: Specialized evaluation for indoor scene captioning accuracy

Hardware Requirements

For optimal performance with the 3B model:

  • GPU: NVIDIA T4 or better (16GB+ VRAM)
  • CPU: 4+ cores
  • RAM: 16GB+

References

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support