Hey @FiftyOneFifty! Thanks for the question. As always with these types of questions, the answer is “a little of everything.”
As I mentioned in a previous post, there are two primary styles of speech recognition. First (and most commonly provided to developers) uses a pre-defined grammar definition, and all things ever to be recognized must fit into this grammar. The existence of the grammar itself limits the scope of all things recognizable, makes updating the vocabulary live difficult, and the sheer size of the grammar (imagine adding all music artists so you can request any style of music from spotify) can be problematic, even with more lax space constraints of a desktop pc. There are commercial offerings in this space (most notably Nuance), but they are not free, and definitely not OSS.
The second style of speech recognition is called “dictation,” and as the name suggests it’s for transcribing general purpose dictation. This is the type of thing you use on a daily basis in your smartphone (via google now or siri). Nuance is (again) a competitor in the space, and their tech was rumored to back Siri, though I would guess that APPL has taken a lot of that in-house. There isn’t (to my knowledge) a high-quality open-source dictation recognizer available, and it would require significant specialized experience in the community to create one. If we can find one, running it off-device would probably be a requirement as well.
For either of these scenarios, there are two large datasets that need to be collected: an acoustic model, and a language model. The acoustic model is a standardized catalog of sounds in a language (or language subset), and the language model is a catalog of how those sounds form words. <\endOverSimplifiedExplanation> The wikipedia article on speech recognition is a good place to start off for learning about this stuff: https://en.wikipedia.org/wiki/Speech_recognition . Creating both sets of models requires a large amount of data, and knowledge of a specialized set of tools. Most of these tools are out of academia, and not particularly well packaged, so just getting them running can be a struggle. There are some public domain data sets available (like an english acoustic model of a male reading the wall street journal), but they’re few and far between.
So, in summation, high quality data sets can require a lot of resources (mem and cpu), the OSS tech for STT may or may not be of sufficient quality to make mycroft a good experience (we’re still investigating OSS solutions), and data. Always with the data!
As @ryanleesipes has mentioned, we’ll be using Pocketsphinx for local wake-word recognition (which is pretty limited in its capabilities), and then kicking off-device for dictation TTS. Personally, I intend to make that latter part plug-and-play, so users can switch between Mycroft, GOOG, AMZN, or any other provider they’d like to use. The trade-off for you will be quality of recognition vs privacy concerns, as GOOG and others have superior tech/resources, and likely will for the foreseeable future.