Grokotron: STT on the edge

Originally published at: Grokotron: STT on the edge - Mycroft

Mycroft AI’s primary mission has always been to create a true privacy-respecting voice assistant, one that is truly a personal assistant rather than a household spying device, a device which does what you want it to do rather than what the mega-corporation that sold it to you wants it to do. So one of the greatest challenges for us has been the lack of a fast, accurate, flexible Speech to Text (STT) engine that can run locally. While the product is still in early days of development, we believe we finally have an answer to this problem. We call it Grokotron.

For a voice assistant like Mycroft, speech recognition must be performed very quickly and with a high degree of accuracy. This is one of the reasons that voice interfaces have exploded in recent years. When it comes to automatic speech recognition, the difference between 80%, 90%, 95% or even higher accuracy may sound like small potatoes, but they are absolutely game changing for how usable a system is in the real world.

We’ve tried a lot of local STT options over the years, and while there’s been incredible work going into many projects, unfortunately nothing has come close to providing the level of experience we think is required for a general purpose voice assistant.

For this reason, by default Mycroft has used Google’s STT cloud services and layered on some additional privacy protections. We proxy the requests through Mycroft’s servers and delete identifying data related to these requests as soon as possible. (You can read more about that here.) But as much as we try to mitigate the privacy exposure inherent in such a system, this has always been a stop gap solution – a necessary evil in order to provide a quality voice experience.

We want Grokotron, our new STT module (based on the great work done on the Kaldi project), to break this reliance. It is not yet ready to replace big data cloud services for all users and all use cases, but we have big plans for it and look forward to it becoming a viable replacement for cloud services for those who want a zero-trust privacy solution.

Grokotron provides limited domain automatic speech recognition on low-resource hardware like the Raspberry Pi 4 that comes in the Mark II. It does this extremely quickly, and of course completely offline. Grokotron’s impressive accuracy and performance is due to its hybrid nature. It includes both an acoustic model and a grammar of expected expressions which constrains its transcription. This grammar is easy to define and extend with a simple markup language. This ability to be expanded easily means that while the range of expressions Grokotron can process is limited, it can be quite large and can be practically extended to cover nearly anything a voice assistant needs.

So whilst it won’t yet be transcribing your original space opera screenplay about an invasion led by the first Pontifex Dvorn… It can understand all of your requests to check the weather, set a timer, even play different music.

To show this in action, we wanted to share a complete proof of concept image. This is a Mark II Sandbox image running the new Dinkum software with a couple of tweaks.

  1. It has Grokotron pre-configured for STT
  2. It does not need to be connected to the internet to function.
  3. It has our backend pairing completely disabled, so even if you do connect to the internet, it won’t touch our servers.

Because this image is designed to run completely offline, functions normally provided by our backend are not available, including paid API’s like the weather and Wolfram Alpha. Some settings normally configured on the backend such as the device’s location must also be set manually within your mycroft.conf. See the Grokotron documentation for details.

The grammar pre-configured on this image does not yet cover all expressions which Mycroft’s core intent system can understand, however it is straightforward to update the grammar and retrain the model on-device. Details on the sentence template syntax and training commands can also be found in the Grokotron documentation.

A Mycroft system already knows the majority of utterances that it is expecting to hear. These strings form the basis of both intent matching and integration test cases. A future optimization would be to reduce duplication of these definitions, and with Grokotron utilize them to provide a local-only grammar model for any Skill that gets installed.

Even big cloud STT systems have trouble with proper nouns. Media libraries are a classic challenge here. Beyonce is only known because of how popular she is, but how about Ke$ha, or Urthboy? These names are trained into cloud based models courtesy of partnerships with streaming media providers, but for open source tools these terms have traditionally been a bridge too far. Grokotron can use entity lists to define exactly the names it needs to recognize for each individual user’s case, which can go a long way to mitigating this problem. Even better, such lists can be compiled and the model efficiently retrained on the fly. For instance, on ingestion of a music library, artist names could automatically be compiled into Grokotron’s grammar. This is just one feature we plan to work on to make Grokotron the best local STT system out there.

Without further ado, you can find the first Grokotron image here:

Download Grokotron

Grokotron Documentation

11 Likes

This is a great post-holiday announcement post. One of the reasons I committed to picking up a Mark II is because I saw commitments to running many of these services offline. Very cool to see that taking shape a few weeks after receiving the Mark II.

7 Likes

Yes! Yes! Very excited to see this.

But to support the open ecosystem, it would be great to see instructions for building Grokotron and the mycroft-dinkum image from git repos rather than relying on prebuilt images!

3 Likes

Agreed! I would really love to see more base level documentation.

1 Like

Yes, not to mention simply more base level communication about how Mycroft.AI is going to face up to its current crisis of confidence among the community brought on by the company’s sheer lack of communication regarding the shocking introduction of the beta-quality Dinkum with the lengthily delayed arrival of the Mark II. How are they planning to bring the Dinkum up to par with where their marketing implies the Mark II is at? How are they going to work with and incorporate the community in true FLOSS fashion? Whiz-bang stuff for the potential future can’t entirely make up for present deficits. These issues must be faced and addressed in the here and now. If Mycroft.AI doesn’t pull itself together it may not make it into that shiny future potential – how tragic that would be and how preventable by the kind simple communication that has not been happening for months now.

7 Likes

Pretty much my sentiments. Whoever is making the decisions to remain silent(and yes, I do believe it is a conscious decision by someone in leadership), should be released from employment or overridden by the other employees. This is one of the worst rollouts I have ever seen and I’ve taken part in more kickstarters than I can count.

2 Likes