Qwen2.5-Omni Inference Endpoint

This repository contains code for deploying the Qwen2.5-Omni-0.5B model to Hugging Face Inference Endpoints for use with the Indoor Scenes dataset.

Overview

The LLaVA-Onevision implementation with Qwen2.5-Omni provides multimodal capabilities for:

Image captioning
Audio recognition
Video understanding
Test-time scaling implementation

Deployment Instructions

Setup your Hugging Face account:
- Ensure you have a Hugging Face account with a valid API token
- Use huggingface-cli login to authenticate

Create and push to a Hugging Face repository:

huggingface-cli repo create YOUR_USERNAME/my-qwen-omni-endpoint --type model
git init
git add .
git commit -m "Initial commit"
git remote add origin https://huggingface.co/YOUR_USERNAME/my-qwen-omni-endpoint
git push -u origin main

Deploy to Inference Endpoints:
- Go to your repository on Hugging Face
- Navigate to "Settings" > "Inference Endpoints"
- Create a new endpoint
- Select appropriate hardware (recommend at least 16GB GPU)
- Deploy!

Using the Endpoint

Text-only example:

{
  "conversation": [
    {"role": "user", "content": "Tell me about yourself."}
  ]
}

Image example:

{
  "conversation": [
    {
      "role": "user", 
      "content": "What do you see in this image?",
      "images": ["https://example.com/image.jpg"]
    }
  ]
}

For MIT Indoor Scenes Dataset

This endpoint is specifically designed to work with the MIT Indoor Scenes dataset from CVPR 2019. The model can be used to generate captions for indoor scene images to evaluate captioning performance.

Testing Test-Time Scaling

The implementation supports test-time scaling through the standard inference interface, allowing for:

Budget scaling/forcing
Beam search integration
Various performance metrics

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support