Community STT Comparison

Hey folks, I have some scripts put together that launch audio files at several different speech-to-text engines and does a comparison. I’d love to gather some community samples, with all kinds of different accents and text complexity, and share that text comparison with the community. If you’re interested, please reach out and share a recording of yourself. Thank you so much!


Very interesting.

For people like me without to much imagination, perhaps it would help to list a view transcriptions that we could read.

1 Like

That is indeed the intention. For my individual use case, I just ran a diff command, but it wouldn’t be too much work to visualize that diff in a friendly webpage. It was surprising to see just how much better the winner among the STT engines was!

1 Like

You may want to contact @KathyReid about this as well, since she’s [still? previously?] been involved in dealing with underrepresented languages in computing.

1 Like

hey thanks for the shout-out @baconator, and hope you’re doing well. Yeah, this is definitely my wheelhouse. FWIW, Mozilla’s CV v15 dataset has about 1 million rows of English speech that has accent metadata, if that’s useful.

1 Like

I’ve also stopped sharing recordings of my voice because of the massive reduction in how much voice data is required to do voice synthesis / voice cloning - models like VALL-E require as little as 3 seconds of data:

1 Like

Wow, thank you Kathy! I was planning to stay away from Common Voice since I would assume it’s going to be part of the training data for models like Whisper and NeMo, but this isn’t my area of specialization. Would you say it would be a valid test to use these examples for a test like this?

1 Like

As a time-sensitive work-around, if you know when the training data set was pulled for Whisper or Nemo, you could choose some samples submitted more recently, if Common Voice has done a more recent release of voice data.

Nice to see you here again @KathyReid ! Thanks for contributing :slight_smile:

1 Like

Hey @mikejgray, I can confirm that Common Voice was not used as part of the training set for Whisper, per their paper (it was used as an evaluation set):

  • Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

I cannot speak to NeMo as I interned with the NVIDIA NeMo team in 2021-2022 and am under NDA.

@NeonClary’s suggestion is a good one, and the v16 Delta of CV may be a good test set, although I’d want to know more about the accent / metadata composition of the delta (the analysis I’m working on uses up to CV 15 only) prior to using it as a test set.

See also: