Maximum duration in one input?

#3
by coder543 - opened

The example on the card does not do any kind of explicit batching. When I run the model against a 20s sample, I get coherent output. When I run that code against a ~20 minute sample, I get nothing resembling accurate output.

Are we supposed to manually batch the audio? What is the recommended duration?

IBM Granite org

Hi @coder543 ,
The model was trained with a maximal duration of 2 minutes (120 seconds) - can you try chunking into ~2min segments?
Thanks!

BTW @coder543 I wouldn't recommend larger batches nor even the full 120 seconds. Here are some benchmark results. One thing to keep in mind, however, is that shortening the audio chunk duration requires more "stitching" at the chunk boundaries to ensure the text is cohesive...but as long as you have a sophisticated way of doing this (feasible), there's little to no benefit to chunk sizes larger than 30 seconds IMHO. Unless @Avihu could perhaps tell us if using 30 seconds might cause a degradation in quality for being too short since, after all, the model was trained on 120 second chunks...basically the opposite of having too long of chunks???

image

image

image

Sorry, forgot to mention that this was done on an RTX 4090 processing the ~2hr sam altman interview that's open source...so this model is very fast...not quite as fast as Nvidia's Parakeet-TDT, but if the quality claims are true, it's "arguably" the best model when weighing quality and processing time. Haven't tested on CPU yet though...don't know if it's even possible but Parakeet-TDT does run fairly well on CPU...but then gain, Parakeet-TDT requires the large nemo-toolkit dependency and isn't Transformers 5+ compatible yet, and I've been running it with patched source code...whereas this model "just works..." tested it with Transformers 4.56.7 and 5+ so...

IBM Granite org

@Avihu could perhaps tell us if using 30 seconds might cause a degradation in quality for being too short since, after all, the model was trained on 120 second chunks...basically the opposite of having too long of chunks???
@ctranslate2-4you It was trained with at most 120 seconds, any training utterance with a outside 0.1-120 was excluded from training.

Longer utterances provide more context, and might help to improve accuracy (and there's less stiching)
But the attention requires more computations with longer context.
And thanks!

And thanks!
@Avihu
Yep, and super excited to test out the accuracy some more (with good stitching) as well as work it into some of my programs. I'm uncertain how it'll perform on CPU but we'll see!

Anyhow, I for one appreciate IBM taking the time to release "reasonably focused" "dense" models, which are almost becoming niche. More and more I'm seeing absurd 1 trillion, 600+ billion parameter models on huggingface...basically things neither myself nor everyday consumers will be able to run. I...just...want...a high quality model...that excels at RAG...or...i just want...a high quality model...that excels at vision tasks....For example.

I don't need a mixture of experts model with 17 quadrillion "experts" with only two active at one time (obvious exaggeration)...I don't need a single model that does vision, transcribes audio, performs text to audio verbalization, detects bounding boxes, and folds my laundry....SOMETIMES, it's nice to have a straightforward flipping "dense" model that just does a subset of tasks really well...So I'd like to give IBM credit for that. Mistral used to do that but not so much anymore...

Sign up or log in to comment