There is a complete lack of Linux voiceAI interoperability

StuartIanNaylor · August 31, 2020, 4:26am

Its fubar to have proprietary audio satellites locked into certain systems.
You have your own mechanisms and that has its purpose but if you boil it down a mic & speaker is basically all a voiceAI common denominator needs to be.

Mycroft the ASR server is all good for me, as so is Rhasspy the ASR server and the many that are available but I should be able to use base functionality of a satellite, with any ‘Open Source’ VoiceAi server and having to put ‘Open Source’ in quotes due to dubious choices in interoperability is such a shame.
Its such a shame that in reality Mycroft, Rhasspy and others have zero interoperable functionality and common hardware of simple satellites can not be reused without complete firmware changes.

The base function of a satellite is audio only as without a voiceAI doesn’t work and all else are just additional additions that should be attributed to that device but not exclusive.
A pixel ring on a device is a pixel ring device, a screen is very much the same and so is any further satellite hardware.

I have created a repo that currently is bare so that its not Rhasspy, Mycroft or others and has no intention other than being a simple interoperable satellite platform.

Satelites should be like keyboards & mice and we just plug them in and if anyone wants add input then please do.
I have my fave audio RTP of snapcast and wondering if anyone has any Airplay experience or other to bring into initial scope. Gnomecast, https://roc-streaming.org/ and others … ?

Its extremely easy to setup an array of servers and clients that play on a loopback sink and so present a source and some small containers would cover it.

Avahi, audio-rtp & docker and I think you are getting gist…
Its 1satellite to work with all not to rule so KWS and whatever is purely choice, but is there any interest as its very possible to create with some simple tools that already exist.
Thinking maybe kickstarting it with Precise or Raven and seeing if you guys might be open to allowing the Rhasspy functionality which is little more than Alsa hooks.

Going to do the same @ Mycroft and see if anything is occurring

Dominik · August 31, 2020, 6:45am

As you seem to be familiar with Rhasspy you most likely came Hermes-Audio-Server and its successors Rhasspy-Speakers and Rhasspy-Microphone which allow audio transport over MQTT.

Shouldn’t be overly complicated to integrate this into Mycroft Core/Audio-service…

There is also a variant of Hermes-Server that runs on Matrix-Voice-ESP32 (played with it, but never got this running though - but this is how I actually got aware of Hermes…)

StuartIanNaylor · August 31, 2020, 11:53am

To be honest its Hermes-Audio that has prompted my call for Audio-rtp as to start with audio transport over MQTT is just a strange concept.

No please don’t ignore the many specific audio protocols that provide Qos, latency and function of streaming audio that are audio protocols specifically designed for audio not a light weight bi-directional message queue that in terms of streaming audio is probably the worst protocol you can use.

I quite like Rhasspy post KWS in fact I used to like Rhasspy but where its going is their decision not mine but in terms of interoperable satellite that uses standard methods you have just killed it in one fell swoop by adding a proprietary protocol such as Hermes-audio.

The Matrix-Voice-ESP32 sums up very much what I think of Hermes audio as its a $75 sound card/pixel ring that is a WTF of required functionality and apparently doesn’t work.
So definitely no to integrating an absolutely ridiculous idea that voiceAI is special in terms of streaming audio and that a project is going to design its own protocol just for the sake of it.

Mycroft Core uses standard Alsa/Pulse audio which both suck for reliable network streaming but work perfectly as default Linux audio systems and that is exactly my point there should be no integration needs into Mycroft Core as standard Alsa/Pulse audio should be used.

Also for the audio-rtp it would seem wise to use an audio-rtp protocol not a bidirectional messaging and monitoring protocol and that is why I posted as was interested in what audio-rtp protocols and was wondering what others would throw up as choice?
There are already many wireless speaker systems out there and using Hermes-audio will play on none of them and is a totally non interoperable proprietary protocol with ridiculously low adoption as without Snips its now only used in Rhasspy!

StuartIanNaylor · August 31, 2020, 12:47pm

To get back on track with interoperable linux standards Alsa (Advanced Linux Sound Architecture) as far as I am aware can have any number of devices it just depends on kernel setting and if Alsa is a module as it can also be configured as a modprobe setting.

But when you start to think of what is fit for purpose on an individual voice server even the common configured max of 32 covers many.
I am not exactly sure as have never scaled up to this amount but an audio server is usually a one way stream of multiple channels but on the theme of MatrixModules aloop can support 8 devices with 8 sub-devices where clients can present a source by play into an aloop sink.
That is max devices before we get to PCM streams.

That allows all audio to be externalised from the mycroft core and ensures interoperability as mics & rooms can be exposed as standard Alsa sink/sources.
There is no need for proprietary protocol over MQTT, but down the line on larger systems could use it to create server clusters but at this level we have absolutely no need.

I was wondering what audio/streaming curve balls would be thrown in terms of interoperability now and do the likes of Spotify, Sonos even Chromecast come into play and to gather advocates as they can give details of their preferred protocol requirements.
Interoperability is about bringing common format(s) to Mycroft so that it can co-exist with multiple systems without being forced in to singular operation of uncommon protocols.
Hermes-audio can be one of those interoperable systems working via standard linux architecture , but it definitely should not be integrated into the core or have need to be.

Guyverix · September 8, 2020, 3:42pm

I have not really dug into it much yet, but are you asking about something like this for RTP multicast streaming: http://www.pogo.org.uk/~mark/trx/

It kinda looks like it would fit the parameters of what you are attempting to do…

StuartIanNaylor · September 8, 2020, 3:58pm

That is one out of many what I am suggesting is that a common input linux patchbay should be the ALSA snd-loopback on a ASR server.
Then if http://www.pogo.org.uk/~mark/trx/ , airplay, snapcast, roc , pulseaudio-rtp, gnomecast or whatever… can be installed and play into the loopback so on the corresponding sub-device on snd-loopback will be present as a standard ALSA source.

j1nx · September 8, 2020, 4:26pm

Not 100% sure if I understand what you want, but isn’t pulseaudio able to do what you would want?

Loop device sink and you can add whatever playback you want.

You can then pass on the loopback sink back into alsa userland.

But again, not sure if this is what you mean.

StuartIanNaylor · September 8, 2020, 5:26pm

Yeah exactly what I mean as yes with a single satellite it would work now as you just select the right Alsa snd_ loopback sub-device.

We don’t have a method for multiple mic/speaker satellites and as above the majority can be done with standard Linux methods.
But exactly how and what many work actually beaks what is a simple function.
A KWS would work as RTP should only broadcast from KW trigger to VAD silence so its not constantly broadcasting and also only triggered mics offer RTP.

There is more to think about than just that but actually not all that much and having interoperable satellites to give choice of preference of KWS system (satellite) to ASR server (Mycroft ASR>>>server) or Linto or Rhasspy… could be beneficial to all and provide choice of preference via mix and match.

The only thing really missing is the KWS trigger value as the highest value on the server should be the one that is used and instantly you have created a wide distributed mic array without what seems in many projects a huge amount of unnecessary bloat to create something as simple as a KWS mic/speaker satellite.

Here with a similar project like Rhasspy they use Hermes protocol which is a ton load of unnecessary proprietary bloat but really satellites should become common devices that work plug and play with any VoiceAI server.