Huge multilanguage KW dataset

StuartIanNaylor · March 1, 2022, 5:49am

New to me so thought I would post.

Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).

I still have my usual sources of http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz and common voice have Single Word Target Segment from Common Voice

I used to use sox and sort the hey ‘kw’ by pitch into a csv and then do the same with another which was ‘marvin’ then do a little script to concatenate the 2 into 1 sec KWs and worked a treat.
The more phones you use and the more unique the better and presume as still to download the english model alone to find out what words are included.

PS I did notice Precise has been quantised to TF-Lite and it runs great on 64bit TF-Lite as neon is 64bit and that makes a considerable speed improvement.
From what I saw though the mycroft dataset still splits into just 2 labels of KW & !KW and my take is that !KW is so general in what people might use it becomes extremely under fitted with so much variance that often it can accept almost anything.
Just add one more label Noise bung your noise files into that label but make !KW just spoken words that are not your KW and I will guarantee you will get more accuracy.

Just saying

You might be aware of Multilingual Spoken Words Dataset | MLCommons Datasets already but new to me and its always bugged me to have a ton of ASR datasets whilst KW was such a struggle.