| --- |
| language: |
| - multilingual |
| tags: |
| - audio |
| - text |
| - multimodal |
| - seamless |
| - subtitle-editing-time-prediction |
| library_name: transformers |
| base_model: facebook/hf-seamless-m4t-medium |
| license: cc-by-nc-4.0 |
| --- |
| |
| # videoloc/seamless-basic |
|
|
| ## Model Description |
|
|
| This is a **SeamlessBasic** model that processes audio and text inputs to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment. |
|
|
| The model is built on top of Meta's SeamlessM4T and fine-tuned on a multimodal dataset containing audio-subtitle pairs with editing time annotations across 5 languages: **English, French, Spanish, Italian, and German**. |
|
|
| ### Key Features |
|
|
| - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs |
| - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability) |
| - **TTE Prediction**: Predicts editing time required for subtitle segments |
| - **Direct Output**: Raw time values in seconds for immediate use |
|
|
| ## Model Architecture |
|
|
| The model consists of the following components: |
|
|
| 1. **Audio Processing**: |
| - SeamlessM4T speech encoder (frozen) processes raw audio input |
| - Audio projection layer maps speech encoder output to 1024 dimensions |
| - Mean pooling over sequence length to get fixed-size audio embedding |
|
|
| 2. **Text Processing**: |
| - SeamlessM4T text encoder (frozen) processes tokenized text input |
| - Text projection layer maps text encoder output to 1024 dimensions |
| - Mean pooling over sequence length to get fixed-size text embedding |
|
|
| 3. **Feature Fusion**: |
| - Audio and text embeddings are concatenated (2048 total dimensions) |
| - No additional cross-modal attention or complex fusion mechanisms |
|
|
| 4. **Regression Head**: |
| - Multi-layer perceptron: 2048 → 1024 → 512 → 256 → 1 |
| - ReLU activations and dropout for regularization |
| - Single output for TTE prediction (regression, in seconds) |
|
|
| ## Quick Start |
|
|
| ### Installation |
| ```bash |
| pip install transformers torch torchaudio huggingface_hub |
| ``` |
|
|
| ### Basic Usage |
| ```python |
| from transformers import AutoModel, AutoConfig |
| from huggingface_hub import hf_hub_download |
| import torch |
| import numpy as np |
| import importlib.util |
| |
| # Load model - custom architecture requires importing the model class |
| model_files = hf_hub_download(repo_id="videoloc/seamless-basic", filename="modeling_seamless_basic.py") |
| spec = importlib.util.spec_from_file_location("modeling_seamless_basic", model_files) |
| modeling_module = importlib.util.module_from_spec(spec) |
| spec.loader.exec_module(modeling_module) |
| |
| # Now load the model using the custom class |
| config = modeling_module.SeamlessBasicConfig.from_pretrained("videoloc/seamless-basic") |
| model = modeling_module.HFSeamlessBasic.from_pretrained("videoloc/seamless-basic") |
| |
| # Load the data collator (included in this repo) |
| collator_file = hf_hub_download(repo_id="videoloc/seamless-basic", filename="data_collator.py") |
| spec = importlib.util.spec_from_file_location("data_collator", collator_file) |
| collator_module = importlib.util.module_from_spec(spec) |
| spec.loader.exec_module(collator_module) |
| |
| # Initialize data collator |
| data_collator = collator_module.DataCollatorSimpleSeamless( |
| processor="facebook/hf-seamless-m4t-medium", |
| max_audio_length_sec=8.0, |
| max_text_length=256 |
| ) |
| |
| # Prepare your data |
| your_data = [ |
| { |
| 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz |
| 'raw_text': "Your subtitle text here", |
| # Note: No translation features needed for basic model |
| } |
| ] |
| |
| # Process and run inference |
| batch = data_collator(your_data) |
| model.eval() |
| with torch.no_grad(): |
| outputs = model(**batch) |
| tte_prediction = outputs.logits.item() |
| |
| print(f"Predicted Time To Edit: {tte_prediction:.2f} seconds") |
| ``` |
|
|
| ## Model Details |
|
|
| - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium) |
| - **Audio Encoder**: Frozen SeamlessM4T speech encoder |
| - **Text Encoder**: Frozen SeamlessM4T text encoder |
| - **Hidden Size**: 1024 |
| - **Audio Input**: 16kHz |
| - **Output**: Single regression value (TTE in seconds) |
| - **Task**: Subtitle editing time prediction |
|
|
| ## Data Format |
|
|
| Your input data should be a list of dictionaries with: |
| - `raw_audio`: NumPy array of audio samples (16kHz sampling rate) |
| - `raw_text`: String of subtitle text |
| - `labels`: Target TTE values in seconds (optional, for training) |
|
|
| Example: |
| ```python |
| data = [ |
| { |
| 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz |
| 'raw_text': "Subtitle text content", |
| 'labels': 2.5 # optional TTE target value in seconds |
| } |
| ] |
| ``` |
|
|
| ## Performance Metrics |
|
|
| - **Best Eval RMSE**: 33.34 |
|
|
| ## Training Details |
|
|
| - **Base Model**: facebook/hf-seamless-m4t-medium |
| - **Epochs**: 10 |
| - **Batch Size (Train)**: 32 |
| - **Batch Size (Eval)**: 64 |
| - **Learning Rate**: 1.2e-4 |
| - **LR Scheduler**: cosine_with_restarts |
| - **Warmup Ratio**: 0.05 |
| - **Weight Decay**: 0.001 |
| - **Optimizer**: AdamW (torch) |
| - **Max Grad Norm**: 1.0 |
| - **FP16**: True |
| - **Early Stopping Patience**: 5 |
| - **Audio Max Length**: 8.0 seconds |
| - **Text Max Length**: 256 tokens |
| - **Sample Rate**: 16kHz |
| - **Normalization**: None (raw values) |
| - **Dataset Split**: 80/20 train/test |
| - **Random Seed**: 42 |
| - **Metric**: RMSE (lower is better) |
|
|
| ## Training Configuration |
|
|
| The model was trained with the following specifications: |
|
|
| - **Dataset**: Multimodal audio-subtitle pairs with TTE annotations (5 languages: EN, FR, ES, IT, DE) |
| - **Train/Test Split**: 80/20 with random seed 42 |
| - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset |
| - **Text Processing**: Max 256 tokens |
| - **Normalization**: None (raw TTE values in seconds) |
| - **Caching**: Audio segments cached and compressed for efficiency |
|
|
| ## Usage Notes |
|
|
| - This is the **basic** variant - processes only audio and text |
| - For translation-aware models, see `seamless-translation` and `seamless-langpairs` |
| - Model expects 16kHz audio input (automatically resampled by data collator) |
| - Text is processed with SeamlessM4T text encoder |
| - No feature normalization applied - outputs raw TTE predictions in seconds |
| - Optimized for subtitle editing time estimation tasks |
|
|
| ## Limitations |
|
|
| - Designed for TTE prediction, not general audio-text matching |
| - Performance may vary on out-of-domain content or different editing workflows |
| - Requires specific data preprocessing (use included data collator) |
|
|
| ## Related Models |
|
|
| - **[seamless-translation](https://huggingface.co/videoloc/seamless-translation)**: Adds translation awareness features |
| - **[seamless-langpairs](https://huggingface.co/videoloc/seamless-langpairs)**: Includes language pair embeddings for multilingual scenarios |
| - **[seamless-crossattention](https://huggingface.co/videoloc/seamless-crossattention)**: Advanced cross-modal attention mechanisms for sophisticated audio-text interactions |