Question #1

by am2460162 - opened Mar 15

Mar 15

What type of dataset was used for training and validating the model in MSST?
Approximately how much audio was in it (number of tracks / total duration)?
On which GPU was it trained and how long did it take?
Where was it trained (Google Colab, Kaggle)?
Can you provide a tutorial on training model for separation (stem separation)?

gilliaaan

Owner Mar 16

Hey, i'm gonna answer most of your questions.

Dataset: Fully synthetic. No real stems were used. The dataset is constructed in 3 cases:
Dataset size: ~1755 tracks, ~4 minutes average each, so roughly 117 hours of audio. Validation set is 30 tracks of 1 minute each (you can compare SDR of other models here: https://docs.google.com/spreadsheets/d/1uWOC4XtIYHila7OeuX4N13RFBT4ZN9Dh4TfFQbS9R0M/edit?usp=sharing )
GPU and training time: RTX 4070 Ti 12GB. Epochs took around 30-45 minutes each depending on the architecture. Total training time across all experiments has been several hundred hours spread over multiple runs and architecture experiments.
Where: Local Windows machine, no cloud.

I can't provide a step-by-step tutorial right now, but the full training pipeline uses MSST WebUI by ZFTurbo.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment