Just a quick observation on how we evaluate our models: The two data vendors mentioned earlier are offering datasets directly to ASR service providers for training. With that in mind, it feels like we should be extra prudent about mixing private data into our evaluations.
At the same time, we're missing out on a lot of high-quality, open-source speech datasets simply because their creators don't have sales teams advocating for them. It would be a huge step forward to open a channel where the research community can recommend these datasets to the leaderboard directly.
P.S. I found myself relating a lot to King George in the movie Hoppers recently. That beaver king just believes the best about everyone. I want to bring that same optimism here—hoping that everyone continues to follow the leaderboard rules, avoids training on open-source test sets, and actively cleans their training data of any contamination.
Wei Chu
Researcher
wei@olewave.com