Question #1
What type of dataset was used for training and validating the model in MSST?
Approximately how much audio was in it (number of tracks / total duration)?
On which GPU was it trained and how long did it take?
Where was it trained (Google Colab, Kaggle)?
Can you provide a tutorial on training model for separation (stem separation)?
Hey, i'm gonna answer most of your questions.
Dataset: Fully synthetic. No real stems were used. The dataset is constructed in 3 cases:
Dataset size: ~1755 tracks, ~4 minutes average each, so roughly 117 hours of audio. Validation set is 30 tracks of 1 minute each (you can compare SDR of other models here: https://docs.google.com/spreadsheets/d/1uWOC4XtIYHila7OeuX4N13RFBT4ZN9Dh4TfFQbS9R0M/edit?usp=sharing )
GPU and training time: RTX 4070 Ti 12GB. Epochs took around 30-45 minutes each depending on the architecture. Total training time across all experiments has been several hundred hours spread over multiple runs and architecture experiments.
Where: Local Windows machine, no cloud.
I can't provide a step-by-step tutorial right now, but the full training pipeline uses MSST WebUI by ZFTurbo.
