Guest Blog - Hey Mycroft, how do you work? - STT Systems for Voice

Originally published at: Guest Blog – Hey Mycroft, how do you work? – STT Systems for Voice -

“Hey Mycroft, how do you work?”

Sounds like a simple question for a voice assistant to answer, right? Well, yes and no. To paraphrase a potential source, Wikipedia’s “Virtual Assistant" article:

Voice assistants use natural language processing (NLP) to match user voice input to executable commands. Many continually learn using artificial intelligence techniques including machine learning. To activate, a wake word might be used. This is a word or groups of words such as "Alexa", "Hey Mycroft", or "OK Google".
That isn’t a bad response, but it leaves out important points and a lot of detail.

For example, NLP as Wikipedia classifies it contains multiple essential modules of the voice stack, one of the most important being Speech Recognition. Without a speech to text engine (STT), an NLP intent parser couldn’t connect a request to an action. Every voice assistant needs robust STT.

What is Speech Recognition?

“Speech Recognition” is the ability of someone (or something) to conceive and understand speech from another source.

Humans may be intelligent, but can not be considered clever until some specific goals are achieved. When a baby is born, even though it can comprehend sounds, its brain will not be able to explain them. That being said, machines are just like babies. They won’t become intelligent until you spend time teaching them, and they can’t comprehend sound until they’re trained. That point drives us to our next section.

How does Speech Recognition work?

In order for a machine to begin comprehending sounds, an input is needed. All voice assistants (including Siri, Cortana, Alexa, and Google Assistant) get their prompts from a voice input device; a Bluetooth headset, a built-in microphone or anything of that sort. But the sound alone is going to do nothing.

One of the “classic” ways to develop Speech Recognition is detecting phonemes with Hidden Markov Models (HMMs). When still at the early ages of Speech Recognition development, the software has to learn the structure and fundamental rules of a language.

Hidden Markov Models

HMM is widely used in this area due to its design. It is generally based on the principle of determining the most suitable or proper outcome after the previous one provided to it. Depending on which words come before and after each other in a sentence, HMMs output the best guess based on its training data and context.

When you say “Hey”, the voice assistant hears the phonemes “Hh-ey”. Then it searches its database to see what word(s) match those phonemes. If it finds a match for the input then it can produce the appropriate output thus answering your question.

But what if you wanted to say “hay” instead of “hey”. This is where context comes into play. Go on and unlock your smartphone, launch the voice assistant and say something. You’ll see that it takes a small though still existent period of time to produce the output. And sometimes, it may even guess a word before it is spoken later in the sentence. This process depends on what is known as the language model.

When people talk and someone says something irrelevant, then we’d say that what he said was “out of context”, and that’s exactly what the software tries to avoid. By processing the whole sentence you provide it, the software can transcribe speech better by checking if the possible outputs match sentences from the training that include keywords. So, by seeing whether you’re talking about horses or calling out to a human, HMM Speech Recognition can more accurately guess if you said “hey” or “hay.”

Limitations and Availability

Each language can be more difficult to comprehend than others, and could contain more phonemes and various other syntactical and grammar rules. Each language requires a specifically trained HMM to work properly.

These days, using pre-made application programming interfaces (also known as APIs) that get implemented in the software’s source code, the program instantly learns all about phonemes and language rules.

There are various APIs regarding STT, but some of the best ones are Google’s Speech API, Bing Speech API, and Speechmatics. Long-standing Open Source options are CMU Sphinx and Kaldi. Many developers do not have the knowledge and resources required to develop their own STT engine. Using use one of the pre-existing engines allows them to easily add an STT feature in their programs.

The Role of Machine Learning

But the classics are meant to be improved upon. HMMs have given way to Machine Learning based models. Mozilla’s DeepSpeech is one STT application using Machine Learning to transcribe human speech. This has allowed a steadily growing base of developers to create new applications with Speech Recognition features.

A common example used to describe machine learning is teaching a machine to tell cats and dogs apart. If you just wait around for it to magically happen, it won’t. You have to apply some supervised learning (widely used term in the AI field).

[caption id=“attachment_41514” align=“alignnone” width=“1600”]An artists rendering of the Machine Learning process by Selman Design. Selman Design: Selman Design

To do that, you must gather some inputs; in this case, photos of cats or dogs. Then, by tagging what each picture shows you’d feed that collection to the software. This tagged dataset is fed into an ANN (Artificial Neural Network). These are computer models that operate just like our brains. In this example, the network is rewarded for correctly identifying cats in pictures it’s never seen before. Properly rewarding for correct identifications and adjusting for misses millions of times makes the model very good at its task. After millions or billions of training cycles, it should be able to identify whether the object in question is visible in a brand new photo unknown to it.

Computers can “see” sound?

How does visual pattern matching relate to STT? Actually, all this talk of computers ‘hearing’ is, in reality, a visual process. To transcribe speech, computers visualize the sound in a spectrogram. A spectrogram isolates different frequencies and indicates how long each sound wave’s energy lasted.


The same image matching that can be done with cats and dogs can be done with speech spectrograms. Instead of tagged pictures, these networks are fed hundreds of hours worth of spectrograms with the corresponding transcriptions. It’s trained and tested on new recordings and is rewarded for each correct transcription.

These are currently called “end-to-end” STT models. End-to-end models have shown success in transcribing speech in multiple languages with one model, when trained on large enough datasets including those languages. This is the cutting edge of Speech Recognition and the basis of Mozilla DeepSpeech.

Open Datasets

This is why collecting data is so important to technology companies. The more voicemail transcripts, assistant requests, and other recordings a company can capture, the better they can build their STT. That is why most companies keep their data to themselves. Mycroft, however, mostly depends on Open Datasets, where everyone has access to the data. Anyone is able to contribute, edit, or even adopt an amount of the data to utilize it for his or her own project.

What most people fail to understand, however, is the amount of data that is required for speech to be recognized. You’ve seen cases where Siri, for example, has booked a date at a wrong time or booked the wrong hotel room. Mispronunciations or even a slight accent can lead the software to unintended actions. But having an Open Dataset can help with that. Diversity isn’t just another feature. It’s a mandatory requirement for success.

But there is a reliance on well-tagged datasets in machine learning. Remember Tay? The AI robot made by Microsoft was released to Twitter on its own without any supervision and after 24 hours it spewed racist and crude remarks due to the data Microsoft allowed it to ingest.

Spoil the system and the boomerang will turn back your way and strike you. Training datasets must be clean and well-tagged to be effective.

What if you could improve the responses yourself?

This is where Mycroft differs from the other players in the field. Being open source, everyone with access to the Internet can help make Mycroft better. The only thing standing between you and a better voice assistant is a thin wall you can easily demolish by creating an account. In order to do so, start at

Once there, you can contribute to Mycroft’s Open Dataset just by Opting-In and using Mycroft. To go a step further, people can help make Mycroft better by tagging data - listening to various sentences and judging whether the software “heard” right or wrong. This helps in creating a diversity of clean inputs. Thus, with a lot of “practice” Mycroft could understand both a native English/American speaker and an Asian/African/European one with success.

That’s the reason Mycroft is doing so well at the moment and will continue to do even better than most other companies.

How You Can Help Mycroft Improve

There’s always space for improvement, and in this case, you can directly aid in making Mycroft better.

Creating an account is very easy, just go here. Don’t worry, you don’t necessarily have to own a Mycroft device. You could always download and install Mycroft on your Linux computer or Raspberry Pi.

Then you will be able to access what I like to call a couple of mini-games. I say that because they are actually fairly entertaining. These are the Precise tagger and the DeepSpeech tagger.

In the Precise tagger, you’ll be helping Mycroft learn how to better identify whether a spoken phrase is “Hey Mycroft” or not. Pretty simple, right?

For the DeepSpeech tagger, you will get to hear a word or phrase and at the same time, you’ll be given a string of text. Your goal is to judge whether what you heard exactly matches the provided text. This doesn’t only help Mycroft, but it actually helps in building a dataset which will be shared with Mozilla for training the DeepSpeech STT engine.

Both mini-games will award you with points for every task so you can level up and get to the leaderboard. So not only are you helping a great cause but you can also get the satisfaction of being an important contributor.

The Open Future of Voice Starts with You

All this wouldn’t be possible if Mycroft’s platform wasn’t open source, meaning that everyone can help make it better. Apart from what we already mentioned, you could also write your own code or, in the case you are not a developer, simply download Picroft, Opt-In to the Open Dataset, and use Mycroft. You can find everything you need here!

So now that you know how the whole thing works, it is time to take things into your own hands. You can support Mycroft by investing to become a community partner. It doesn’t matter whether you are a developer or just another technology aficionado. Your actions count the same and together we can all help the world better. Even if you are not you could use Mark I or II.

The whole point of Mycroft is that technology should be accessible by everyone and respectively, everyone should aid in creating an open and safe future!