Amount of training data for custom wake word

consetto · July 26, 2019, 11:48am

Hello,

I’m currently training my own custom wake word (WW) and I’m facing some issues:

1- Other words than the wake word are being recognized.

First, I only trained the WW model using the Public Domain Sounds Backup as negative examples (100 files) and 12 files containing the wake word as positive examples. This led the model to recognize almost every noise that contained voice in it, because the PDSB set of sounds is a collection of noises, not voice. Then, I added 18 more positive examples and the recognition got slightly better. Then I added 25 negative examples containing words that sound similar to my WW. The model got much better but stills recognizes many words that are not the intended WW.
Since the model is slowly getting better, I’m guessing that is just a matter of more training data. I would like to know if I’m on the right path or if the training should work much better with less data and I’m just training it wrong.

2- The model only recognizes the WW when it is pronounced very close to the mic.

I tried different mics and the WW ‘hey mycroft’ works really well even from afar. My model only works (if at all) when the person is at least 0,5m close to the mic. When I was recording the audio samples, I asked the persons to pronounce the word with a distance of 5cm from the mic. Could this be a reason? Is there a way to change the sensibility of the model from the code that trains it? I’ve read the code but I couldn’t find any parameter for that. Maybe I missed it?

It is my first time training a ML model, so any further information about this process is highly appreciated.

baconator · July 26, 2019, 2:36pm

For 1) yes, more data, particularly more not-wake-word data (try for maybe 5:1 nww:ww). For nww samples, using rhyme words and similar sounding words is a great idea.
For 2) record samples from a variety of ways. If all your ww samples are from high-volume, clear recordings, that’s what it’s going to recognize. I used a bunch of different samples. Local wakeword recording can also be turned on, then those samples (good and bad) sorted and used to model with.

gez-mycroft · July 28, 2019, 11:02pm

Hi Consetto,

Hope the training is going well. In case you hadn’t seen it in our docs, baconator (aka el-tocino) wrote up a number of their learnings that might be helpful:

github.com

el-tocino/localcroft/blob/master/precise/Precise.md

The [precise page](https://github.com/MycroftAI/mycroft-precise/wiki/Training-your-own-wake-word#how-to-train-your-own-wake-word) has good instructions to get you started.  

#### data

You need more data. 

You should pick a wakeword with at least three syllables or two words making three or more syllables, preferrably that do not have a lot of similar sounding rhymes. 

The more data you can collect, the better, up to about 50k samples.  I've collected over 400 total wake word and about 5000 fake word samples (including generated sounds).  If you're using local uploads, you can review those and add them to your dataset.  Once you have collected your data, try and have an 80/20 training/test split.  ie, for 100 clips, 80 go to the wakewords folder, 20 go to the test/wakewords folder.  In a ten minute span, I can, using precise-collect, record about 75 prepared words.  

Slight update: The google speech commands dataset v0.2 can also be pulled down and used to supplement your not-wake-words.  This contains almost 100k samples.  When adding this to my current data, training slows down quite a bit.  Accuracy and val_acc both improve, I'm seeing val-acc reaching .999+ routinely now.  

Having a base of clean wake word samples to start with seems to work best. It is important that your core data be sourced as much as possible from your target audience.  From there it's a matter of testing to see what is best to model.  Precise modeling runs quickly, even on a cpu, so don't be afraid to start over a few times and try things. 

##### wake words

I've recorded myself quite a bit for my wake word.  I vary speed, inflection, volume, distance from mic, tone, etc. I have gotten about a dozen other folks to record samples for my model as well.  Your target audience is where you should be sourcing most of your data from.  When recording, I've used a variety of mics.  I have a cheap small diaphragm condenser that hits a usb preamp, a cheap usb mic, and a few on a PS Eye.  This isn't necessary, it's just been a matter of what was handy.

I have only recently started recording with noisy backgrounds.  Will update if I get better info.

This file has been truncated. show original