Future of Mycroft STT

modernonline · February 27, 2022, 11:54am

Hi,

I wanted to check in on where Mycroft’s STT roadmap is currently at? Use of Google whilst understandable, is only temporary I guess? Is Mozilla getting anywhere close to something useful? Will the new Mycroft enclosure processor be able to run STT locally?

I’ve seen the upcoming Alexa will be processing transcription locally and Pixel has sufficient hardware to run the Google Assistant, so Mycroft relying on Google cloud is becoming increasingly counter-intuitive privacy-wise?

JarbasAl · February 27, 2022, 12:24pm

you can already run offline STT locally, there are a few plugins

the home assistant demo says it runs fully offline, i believe mycroft is experimenting with coqui on device but i didnt find any repo for it

Dominik · February 28, 2022, 7:29am

Mozilla discontinued the development for DeepSpeech. The former Mozilla Dev-team founded coqui.ai and continues development on STT (and TTS). Their latest pretrained STT-model for English has a WER of 4.5% (on Libri-speech clean test dataset), which I would consider “usable”. In case you have a CUDA-GPU you may look at Nvidia models, e.g. Conformer-Transducer, which has even better WER of 1.7%.

modernonline · February 28, 2022, 8:13am

Ah nice! Also wasn’t aware of the Nvidia ones; should run pretty nicely on a Jetson Nano.

Dominik · February 28, 2022, 9:29am

In my tests on a Xavier-AGX the Conformer-Transducer Model for German language has a real-time-factor of 0.127x (one second of audio input was processed in 0.127 second). I don’t remember how much of the 32GB RAM was used.
As the Xavier-AGX GPU much faster than the Jetson Nano (32TOPS vs 0.5TOPS) you might end up with a RTF of >>1x - which might be inconvenient for daily usage…

baconator · February 28, 2022, 6:46pm

There’s a small model now, I haven’t tried it vs. large but it may work better on devices with less memory.

modernonline · March 1, 2022, 8:05am

Yup that’s quite a difference. I don’t have a Nano handy atm but curious to try.

puchatek · April 12, 2022, 10:06am

Newer versions of deepspeech feature a streaming API. You can run a server locally that makes use of that API (GitHub - JPEWdev/deep-dregs: A streaming Speech to Text server using DeepSpeech) and make mycroft stream the captured sound by configuring the STT engine accordingly:

  "stt": {
    "module": "deepspeech_stream_server",
    "deepspeech_stream_server": {
      "stream_uri": "http://localhost:8080/stt?format=16K_PCM16"
    }
  },

Took a bit of fiddling to get the environment set up on Rasbian buster but ultimately pipenv helped me out.

With that in place you get quite a speed up in the “realtimeness” even if it all runs on a Raspi 4 with 4GB of RAM. Probably easier than getting a Jetson Nano setup working.

Dominik · April 13, 2022, 7:58am

Deepspeech is dead - long live coqui.ai STT, which has a streaming API as well, don’t know if the streaming server mentioned above is still working with it…

StuartIanNaylor · April 13, 2022, 11:12am

Yeah the deepspeech guys jumped to create coqui when the funding dried up so its really deepspeech in disguise with some newer bits and bobs.

I haven’t tried espnet so clueless to load and what is needed to run but currently I think its the one to beat as kaldi has lost some ground.

is quite interesting as they make no boasts and call it “Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2.” but the tflite version is really lite and speedy and need to give it a revisit.

Deepspeech/coqui last time I looked sort of ended up in no-mans land where it was neither stateofart or that lite as just managing realtime on a Pi4 but haven’t looked if the have managed to optimise since.

puchatek · April 14, 2022, 7:11am

don’t know if the streaming server mentioned above is still working with it

me neither. there was an issue on another deepspeech server on github posted by a coqui contributor, urging the implementer to move to coqui so it sounds like at least a little fiddling with the code is needed. maybe just a change in dependency + imports but I can’t say for sure.

plus AFAIK, the first model of coqui was basically just a rebranded version of the last deepspeech model. They have a more recent one which is supposed to perform better but there’s no “tflite” version of it (yet?). So for a raspi-only solution but with streaming definitely working, deepspeech 0.9.3 is still a viable option.