Free german tts voice for mycroft (sneak preview)


We (some nice guys from mycroft community and me) are currently working on a free to use german tts voice based on my personal voice dataset contribution.

The model is based on a tacotron 2 combined with a pwgan vocoder. This model can be run locally and without cloud connection. We’re trying hard (try’n-error) to provide a free model with an acceptable quality in future for a daily usage, but we still have some work to do.

Nevertheless we wanted to show you some sample audio as „sneak preview“ what is currently possible.

Also available on soundcloud

Thanks @Dominik, @baconator, Repodiac, @Olaf :slight_smile:

For more information on the dataset feel free to look at my github page:


Followed your efforts over the last weeks, i’m curious what hardware (CPU/GPU) should be used to facilitate a flawless experience.

Currently you would need a GPU to produce speech in real time. So it would take 2 seconds to produce 2 seconds of audio, maybe 8-10 on a regular pi. So still not what we want, but up until now we didn’t have a free model at all. So a small step for us and a small step for mycroft :slight_smile: @Dominik can give more info, because he has already tried it.


I am running my tests on a Xavier AGX. (A direct comparison with graphics card is difficult but a GTX 10x0 with 8GB should give similar or even better results).

In best case I see a real-time factor of 0.3 (1 sec audio requires 0.3 seconds of processing). As the model still has some problems with “stop attention” this goes up to 5.0. Interestingly this happes with shorter phrases.

But with some tricks like caching of the synthesized audio files you will get a better experience.

1 Like

Still uncharted terrority for me: you can convert the PyTorch-model to TensorflowLite. This may result in better performance on a RPI…

Just to say; Cool guys! Keep it up.


The main take away for me was that the data can be used to produce a reasonably good model. In the beginning it didn’t work and we didn’t know why. Now we know that we can use Thorsten’s data and can try different configs or combinations. @Dominik thanks for the numbers.

1 Like

Little concerned that a top of the class board with 32 TOPS peaks at 5 RTF, but that’s maybe a configurational problem.

I wonder how the coral dev board resp. the broken out coprocessor (USB accelerator) would perform. Don’t like the idea to let my WinPC do the heavy lifting, since this would deem the PC to be powered 24/7.

Is this benchmarked using Mycroft?

In addition to what @Olaf already said you might wanna take a look on this thread.

1 Like

The current model needs some fine-tuning for shorter phrases (up to 6-7 words), longer sentences work better already.

Xavier-AGX has a very good “TOPS per Watt” value. Even in “max power mode” it idles around 5W and peaks at 30-40W.

For tacotron, a gpu would be ideal. I use nvidia 1030’s, they don’t draw much when idle and fanless models are available. Yes, this necessitates running a host with them in it 24/7, but for quality and speed you’re going to have to make some trade-offs.

We’re quickly approaching a place were cpu can be used instead of a gpu, so this answer may change in the next year.

1 Like

Very recently an article popped up on how to setup a Win (easily reproduceable in linux) deepspeech server for Mycroft

Yet, Mozilla trained models seem a little bit different with using pbmm and some separate scorer.

Like the exclusion :grin:

Got a stripped naked 1080 (only 2GB dedicated though)

I already thought that’d catch someone’s eye :grin: .
I’m trying to be “nice” though, but if i’m successful in it should “the other guys” say. :wink:

I can gladly confirm that Thorsten is a nice guy, too. :smile:


As it turns out this is a mmap-able format for inferencing. The pretty easy process to convert pb to pbmm is described here

This would be for deepspeech, not TTS.

The Deepspeech server is serving STT/TTS. I just don’t think it will run another model type.

Deepspeech just does STT. Tacotron is TTS.

This is about the later.

Oh OK, now i’ve dug a little deeper i saw that the articles talking about 2 servers with the second model already packaged, so i haven’t recognized it as such.

So, STT aside. Is the TTS serveing (described in the how-to) still viable? Or what would you suggest?

Is STT modeling sourced from one speaker beneficial?

@Thorsten @Dominik Have you planned to upload the model?

If you’re referring to TTS/TTS/server at master · mozilla/TTS · GitHub then yes, this is still viable and I just sent off a package earlier today using that.

Possibly for that one person. A wider set of submitted data would almost certainly help, even if the bulk of the data was from one person.