Use Nvidia Riva as tts/stt

I have an Nvidia AGX Xavier, and recently the riva speech services docker container has been made available. I have been testing it for a while and it is by far the most accurate and responsive speech recognition service I have tried. What would be required for me to adapt the code to make use of it? I feel fairly certain that I can make a crossover script to accept whatever api calls Mycroft makes to something like Mozilla tts, but is there a more direct way? Thanks!

I own a Xavier AGX too and had a similar idea. Unfortunately Nvidia likes to lock you into their eco-system. I understand Riva has some kind of API that could be used to build a “adapter” for Mycroft calls, but i found it being overly complex, let alone setting up a Trition server to run it.

I tried to run a Nemo-ASR STT model “standalone” as a mini-server but it didn’t work out and the developer team wasn’t too helpful when “3rd party developers” ask for new features.

Now I use Mimic3 for TTS (or Coqui-TTS for even better quality). Still searching for a good STT solution, maybe I make another try with Nemo-ASR…

Yeah I have been speaking to someone on the Rhasspy site who also seems to be finding Nvidia a little unfriendly, when its running its supposedly really good even though a Xavier AGX is at the far extremes of the Raspberry price range.

I am not sure why Mycroft supplies specific modules than just a framework that does allow you to ‘wire in’ any though as STT its input is text and its output is audio as that is what it does.
This refactoring and rebranding of permissive licenced code has always confused me and would almost seem a waste of precious dev time to create a framework to allow any.

mycroft has a plugin system to allow integration with any 3rd party STT/TTS

we also have mycroft compatible plugins for OpenVoiceOS, with the advantage that they can be used standalone in any other project

I never heard of Riva before, but i don’t see why it wouldn’t be compatible

Its the new nvidia framework to replace nemo

I did not bother to respond as starting to worry that everything I say about mycroft is a negative.
The plugins assume an all-in-one design where the plugged in item is the same host python code and I am not a fan.
I feel that each element should be standalone and connected by a network layer so that we are not later on trying to create a distributed model from a designed for singular base.
If you have multiples hosts, servers or containers you should be able to link modules without embedding in python code.
Mycroft has always had this all-in-one focus than a distributed infrastructure which is a shame as it doesn’t naturally scale up whilst you can scale down a distributed infrastructure of multiple containers on 1 host.

I just copied a chunk out of the example transcription script and as far as I can tell I have 2 blocks of code, audio in/returns string of what was said, and string in/ text to speech reads it. However I have no idea where to put this. I tried editing the source stt file and using the deep speech server module as a starting point, and I have it successfully making a call to the inference api, but there appears to be more that I am missing.

The code from the Nvidia Riva docs is basically this for speech to text,

auth = riva.client.Auth(uri='localhost:50051')

riva_asr = riva.client.ASRService(auth)

# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig()
#req.config.encoding = ra.AudioEncoding.LINEAR_PCM    # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0                     # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return
config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
config.audio_channel_count = 1                    # Mono channel

response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("\n\nFull Response Message:")

seems fairly simple to implement, check the chromium plugin for an example

basically init riva_asr object in the __init__ method reading any relevant value from self.config and return the transcript in execute.

self.config comes from mycroft.conf and can be used for any values the end user may want to modify, such as the host url in your case

the new AGX Nano presents a real nice way of using some dedicated Hardware to perform both TTS and really really good → STT capabilities.

So that brings up a question → what’s the best most human like voice model available, is it something on the Riva? If so you can do away with any of the mycroft mimic models and just go with the TTS on the RIVA platform. If it’s one of the mimic 3 voices, any way to offload training of the voice models like mimic3 to specialized hardware like the AGX nano?

To be honest, with the Riva in place, you could do away with mycroft alltogether, all you would need is multiple remote microphones all around your home, and use rhasspy tied to the riva STT module for both TTS and STT. Tutorials - Rhasspy

In this case, what would be the benefit of keeping mycroft around? the plugins, the physical display hardware ? I’m not trying to bash on mycroft here, I was a supporter up until the project got cancelled, I’m simply trying to compare the state of the art in the relevant technologies so i can make a decision as to which one to choose (the best).

Here’s an example of a home assistant using a local LLM → Local „ChatGPT“ Chatbot talk with LLaMA/GPT4ALL + Coqui TTS 🤯 | Install Tutorial - YouTube

I’m a little confused why you were strapping Mycroft and Rhasspy together in the first place. We’re far more cousins than competitors - Rhasspy’s lead dev was MycroftAI’s last lead dev - but this is the first I’m hearing of a hybrid assistant.

thank you for your response, however, it didn’t answer any of my questions. So if you were confused why I was “strapping mycroft and rhasspy” together, it’s because I am trying to understand, differentiate and distinguish, and I’m obviously looking for information in this regard.

Sorry. They didn’t come across like good faith questions. There’s a FAQ pinned.

Hey there! I’m not familiar with Riva, but I run both a Piper server and a Coqui server at home for my assistants. I also run FasterWhisper for STT. Offloading the heavy lifting does make the Mark 2 run quite a bit better, you’re correct, and one day I’d like to have most of my speakers function as voice satellites with the primary login happening someplace else.

I suppose it’d be possible to try to use the pieces of one system with another, but my understanding is that Rhasspy uses the Wyoming protocol, which is pretty unique to that system (and now Home Assistant). It does seem like everyone’s moving towards modular setups now, but you’d have to put an interface on them to make them speak to one another. Jarbas gave an example of how OVOS does that in an older reply.

I’ve got a mix of Neon and OVOS running at home and am experimenting with integrating a local LLM with those systems natively, similar to Thorsten’s video you shared. I imagine Home Assistant and rhasspy will have something similar with slightly different protocols and interfaces. It’s a fun time for open source voice assistants!


As for benefits, STT/TTS are only part of the equation. You’d also need wake word software, intent parsing, some way to pass intents to specific code to execute against them, ways to manage your hardware (eg LEDs, screens, so on), and all sorts of bits and bobs. If you don’t care about any of that and just want to rely solely on a LLM, then that would work fine. Otherwise software like Rhasspy, OVOS, or Neon serve a purpose in having a standard platform upon which to build your assistant’s capabilities.


fascinating mike - that was the sort of high level overview I was looking for…

The idea behind running things on the agx nano is that it is specialized hardware - so maybe would offer zippier performance in terms of training speech synthesis models, but also, speedier TTS and STT. Not sure if I am on the right track here, just really looking for an excuse to buy one at this point hah! :wink: cheers

Sounds like a plan to me! I’m running my TTS and STT on a GPU to get zippier performance and better quality, so if there was a single package, I don’t see a downside. Looks like the implementation for OVOS is pretty simple so if you have questions, please feel free to reach out.