YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen2.5-Omni-3B for Indoor Scenes Test-Time Scaling
This repository contains the code for deploying Qwen2.5-Omni-3B as a Hugging Face Inference Endpoint, optimized for test-time scaling implementation with the MIT Indoor Scenes dataset (CVPR 2019).
Overview
LLaVA-Onevision implementation with Qwen2.5-Omni-3B provides powerful multimodal capabilities for:
- Image captioning and understanding
- Video analysis
- Audio processing
- Test-time scaling with budget parameters
This endpoint is specifically designed for research on test-time scaling techniques using the MIT Indoor Scenes dataset from CVPR 2019.
Features
- Multimodal Input Support: Process images, videos, and audio
- Test-Time Scaling: Implement budget scaling/forcing for controlled generation
- Beam Search Integration: Configurable beam search parameters
- Custom Performance Metrics: Specialized for indoor scene captioning tasks
Usage Examples
Basic Image Captioning
{
"conversation": [
{
"role": "user",
"content": "Describe this indoor scene in detail.",
"images": ["https://example.com/indoor_scene.jpg"]
}
]
}
With Test-Time Scaling Parameters
{
"conversation": [
{
"role": "user",
"content": "Describe this indoor scene in detail.",
"images": ["https://example.com/indoor_scene.jpg"]
}
],
"test_time_settings": {
"budget_scale": 1.2,
"num_beams": 3
}
}
Deployment Instructions
Create a Hugging Face Repository:
huggingface-cli login huggingface-cli repo create your-username/qwen-omni-indoor-endpoint --type modelInitialize and Push:
cd qwen-omni-endpoint-fresh git init git add . git commit -m "Initial commit" git remote add origin https://huggingface.co/your-username/qwen-omni-indoor-endpoint git push -u origin mainDeploy on Hugging Face:
- Navigate to your repository on Hugging Face
- Go to the "Deploy" tab
- Select "Inference Endpoints"
- Choose appropriate hardware (recommend at least 16GB GPU for 3B model)
- Deploy!
Implementation Details
The endpoint implements test-time scaling for LLaVA-Onevision with the following components:
- Budget Scaling/Forcing: Controls the verbosity and detail level in the generated captions
- Beam Search Integration: Improves caption quality through parallel hypothesis exploration
- Performance Metrics: Specialized evaluation for indoor scene captioning accuracy
Hardware Requirements
For optimal performance with the 3B model:
- GPU: NVIDIA T4 or better (16GB+ VRAM)
- CPU: 4+ cores
- RAM: 16GB+
References
- Downloads last month
- 1