Wake Word, Requests and Background Noise

I’m running latest PiCroft in a Google AIY kit V1. It’s working pretty great so far and I love it. That said it’s not perfect. I searched and couldn’t find this issue addressed though there are somewhat close subjects already.

One of the issues I’ve noted is that if I’m listening to podcasts or the TV is playing then it’ll catch my wake word request just fine. But then it seems to have trouble parsing my actual request from the background noise. I assume this is because “hey mycroft” has some kind of ML process making the software better at filtering for wake word vs user’s request. So I got to thinking about a solution to this. I hope you’ll forgive my ignorance on some of these subjects and maybe folks are already working on this or perhaps it’s already happening but constraints keep it from being particularly effective.

I know training for a particular voice is a thing in most other assistants which, unless I’ve missed it, Mycroft doesn’t seem to allow currently? While that would serve as a solution it creates new problems in that other people can’t interact with Mycroft readily.

So what then? I’ve used software like Audacity that lets me take a sample of audio and use that clip as a filter to remove that noise from the rest of a clip. It seems like this but opposite might be a solution for Mycroft. Now I grant there may well be hardware, computational, or time limitations that would prevent this from working but I don’t have the depth of knowledge on this subject to know. So the wake word could be recorded and some algorithm used to create a sort of filter, along the lines of the Audacity filter but in reverse to keep only the voice that spoke the wake word and filter the rest. Maybe this isn’t do-able with the resources on a pie?

Or perhaps some hybrid. Sample background noise pre-wake word, sample the wake word, use the background noise in combination with the wake word sample, to filter just the voice that made the wake word request.

This wouldn’t even need to be real-timey all the time. I’d imagine after a few wake word samples from the same user you could get some fast and local model of the user’s voice which could then be quickly filtered from background noise which is what I assume other assistants that can be trained do pro-actively in the training process.

Is this even feasible? It would be nice to not have to mute whatever content I’m listening to and still be able to reliably interact with Mycroft.

I know I might be asking for currently technically infeasible miracles so if I am I’m sorry. I do very much appreciate all the hard work by the Mycroft team and the community. Thank you!

1 Like

There’s two pieces in play here as you’ve noticed, the wake word spotter and the Speech to Text (STT) engine. If you’re using hey mycroft for your wake word, hopefully you’re using Precise. It takes a bunch of clips of the wake word, along with clips of noise and not-wake-word, then builds a model to only get the wake word. The hey mycroft wake word has tens of thousands of varied samples with and without noise. It’s a highly specific tool.

General STT has to, by its very nature be more general, and getting thousands of samples with varied noise, accent, pitch, inflection, cadence, and speed is not very easy for open source types. Google bought grand central (now google voice) and that jump started their effort with tens of thousands of hours of voicemails*. Mozilla’s Common Voice English corpus, their largest, is finally over one thousand hours. Librispeech has another thousand. Together that’s maybe 1-10 percent of what the large players in the space have. This is one reason open source STT engines are generally not quite as good. Take DeepSpeech. It’s a fairly competent STT engine, and by default sucks at listening to me. Fine-tuning the deepspeech model is a possibility for what you’re talking about, and doing that certainly improves its recognition of my voice. Is it perfect? Nope. It’s still about half as accurate as google’s STT. Running Deepspeech can be done on local hardware**, and for small fine-tuning jobs it would be possible to do locally as well. Now you require additional pieces of software in place, which means more support headaches. What is possible is someone writes a skill or adds core functionality to save your utterances locally so you can do this on your own.

If you run it in the cloud, you now encounter some personally identifying info issues that run contrary to mycroft’s current state of being. That’s a concern to a large number of users as well, and though it’s been discussed before, it’s not happening soon, so I’ll skip that for now.

Having said all that, the default for mycroft is to run through an anonymized connection to Google’s STT engine. If you haven’t changed STT choices, then you’re using what is a top of the line STT service, and it’s having difficulty hearing you.

See here where Joshua talks about the scale of effort between the big tech companies and mycroft briefly as well.

ETA…another possibility that would make some sense is to implement a filtering system on audio capture. For example, running the clip through a denoise tool would probably help.

  • speculatory–probably more, and it’s only grown since.
  • I wouldn’t run on less than a dual core x86 from 2012 or closer, latency gets too high for my taste. Use a gpu if possible.

Hey Tekchip,

Great to hear the AIY Kit is working well.

There’s another aspect I’d add to baconator’s response which is coming in the Mark II with the addition of a mic array. This provides a whole range of digital audio processing on the mic array hardware, providing a much cleaner stream for Mycroft and the STT to process. The second half of this video shows just how powerful this is: