Upload folder using huggingface_hub

c633bae verified 5 days ago

4.8 kB

	# OmniShotCut MLX

	Shot Boundary Detection with OmniShotCut, ported to Apple MLX for native Mac inference.

	Based on the paper: [OmniShotCut: Holistic Relational Shot Boundary Detection with Shot-Query Transformer](https://arxiv.org/abs/2604.24762).

	## Features

	- Pure MLX inference — runs natively on Apple Silicon, zero PyTorch dependency at runtime
	- Detects hard cuts, dissolves, fades, wipes, slides, zooms, doorways, and sudden jumps
	- Tunable sensitivity for different video types (action, interview, vlog, film)

	## Requirements

	- macOS with Apple Silicon (M1/M2/M3/M4)
	- Python 3.10+
	- `ffmpeg` (for video I/O)

	```bash
	pip install mlx mlx-metal numpy
	```

	## Quick Start

	```bash
	# 1. Clone and install
	git clone https://github.com/eisneim/OmniShotCut_mlx.git
	cd OmniShotCut_mlx

	# 2. Download weights from HuggingFace
	python omnishotcut_mlx/download_weights.py

	# 3. Run on test videos
	python run_inference.py
	```

	## Download Weights

	```bash
	# Auto-download from HuggingFace Hub (requires huggingface_hub)
	pip install huggingface_hub
	python omnishotcut_mlx/download_weights.py

	# Or manually download from:
	# https://huggingface.co/eisneim/OmniShotCut_mlx
	# Place OmniShotCut.safetensors and config.json into ./weights/

	# Alternative: download without huggingface_hub
	curl -L -o weights/OmniShotCut.safetensors https://huggingface.co/eisneim/OmniShotCut_mlx/resolve/main/OmniShotCut.safetensors
	curl -L -o weights/config.json https://huggingface.co/eisneim/OmniShotCut_mlx/resolve/main/config.json
	```

	## Usage

	```bash
	# Default: balanced detection
	python run_inference.py

	# Sensitive mode: more cuts, good for action/vlog videos
	python run_inference.py --sensitive

	# Conservative mode: fewer false positives, good for interviews/long takes
	python run_inference.py --conservative

	# Single video
	python run_inference.py --video /path/to/video.mp4

	# Custom output directory
	python run_inference.py --output ./my_shots

	# Fine-tuned control
	python run_inference.py --context 12 --min-shot 0.8 --conf 0.1
	```

	### Tunable Parameters

	\| Parameter \| Default \| Range \| Effect \|
	\|-----------\|---------\|-------\|--------\|
	\| `--context` \| 10 \| 0–20 \| Overlap frames between windows. Higher = fewer missed boundaries, but slower \|
	\| `--min-shot` \| 0.5 \| 0.1–5.0 \| Minimum shot duration in seconds. Higher = fewer false positives \|
	\| `--conf` \| 0.0 \| 0.0–1.0 \| Intra-class confidence threshold. E.g. 0.3 = keep only predictions model is >30% sure about \|
	\| `--sensitive` \| — \| — \| Shortcut: context=15, min-shot=0.3, conf=0 \|
	\| `--conservative` \| — \| — \| Shortcut: context=5, min-shot=1.5, conf=0.15 \|

	### Parameter Guide by Video Type

	\| Video Type \| Recommended \| Why \|
	\|------------\|------------\|-----\|
	\| Action / Sports \| `--sensitive` \| Fast cuts, many short shots \|
	\| Vlog / YouTube \| default or `--context 15` \| Moderate pace, varied editing \|
	\| Interview / Podcast \| `--conservative` \| Long takes, few cuts \|
	\| Film / Cinema \| default \| Balanced \|
	\| Animation \| `--sensitive` \| Frequent scene changes \|
	\| Screen Recording \| `--conservative` or `--min-shot 2.0` \| Mostly static \|

	## Project Structure

	```
	OmniShotCut_mlx/
	├── run_inference.py # Main entry point
	├── omnishotcut_mlx/
	│ ├── model.py # OmniShotCut MLX model
	│ ├── transformer.py # Transformer encoder/decoder
	│ ├── resnet.py # ResNet18 backbone
	│ ├── position_encoding.py # 3D sinusoidal position encoding
	│ ├── load_weights.py # Weight loader (from safetensors)
	│ └── download_weights.py # HuggingFace weight downloader
	├── weights/
	│ ├── OmniShotCut.safetensors # MLX-native weights (~157MB)
	│ └── config.json # Model configuration
	└── test_data/ # Place test videos here
	```

	## Output

	Shots are saved as `shot_0000.mp4`, `shot_0001.mp4`, ... under `test_data/output/<video_name>/`.

	Each shot file is a self-contained H.264/AAC MP4 clip with the detected shot boundary transitions removed.

	## Model

	- Architecture: Shot-Query Transformer (DETR-style), 6 encoder + 6 decoder layers, ResNet18 backbone
	- Input: 100-frame windows at 128×96, ImageNet normalization
	- Output: Shot boundary frame indices + intra-shot relation (dissolve, wipe, fade, ...) + inter-shot relation (hard cut, sudden jump, ...)
	- Weights: Converted from the official PyTorch checkpoint, 363 tensors, float32

	## License & Credits

	Paper: [OmniShotCut (arXiv 2604.24762)](https://arxiv.org/abs/2604.24762) by Boyang Wang et al.

	MLX port by [@eisneim](https://github.com/eisneim). Weights hosted at [eisneim/OmniShotCut_mlx](https://huggingface.co/eisneim/OmniShotCut_mlx).