Description of a speech synthesis engine. Discussion

Description of a speech synthesis engine

The aim of this document is to provide an overview of what a speech synthesis
engine is and what are its main components. Hopefully this can lead to discussion on where should mimic focus and improve. I bet we will need to improve speech reproduction (easy to mute, etc) and work on the other areas to provide multilingual support to mimic.

This is a rough division of a speech synthesis engine can be. It may not
be complete but still gives an overall view of the system. Not all the
modules mentioned below have to actually be in flite/mimic code, but it would
be good to identify most of them to get familiar with the code.

Text parser

This block of the engine deals with the conversion from raw text to a series
of phonemes and additional contextual information of use to actually speak.

When we (humans) read, we all read the same text and extract similar information
about which phonemes we should utter, although each one of us has a different
voice (pitch, etc…).

  • Phrasing: Sentence detection. Splits a whole text into sentences that can be
    processed individually. Can provide some pausing information.

  • Tokenizer: Detects “tokens” in the text, as an example spaces can be used
    to separate a sentence in individual tokens. Between two sentences the audio
    engine could be interrputed for 0.5 seconds without serious concerns, however
    an interruption between two consecutive tokens is annoying.

  • Token to words: Normalizes the text, for instance turns “1st” to “first”,
    or “Henry” “VIII” to “Henry” “the” “eighth”.

  • Part of speech tagger (POS tagger): Helps to distinguish homograph words,
    words that are spelled equally but may have different meaning and pronunciation. Not crucial, but helpful in some languages.

  • Words to phonemes: Provides a phonetic transcription of the words. This
    transcription may be based on a dictionary (Lexicon) and/or on a decision tree
    able to predict the pronunciation of unknown words based on how they are written.

  • Phoneme duration prediction: Not all the phonemes in all the contexts have
    the same length. Some speech synthesis models capture that information directly
    from recordings, where as other speech synthesis models may expect more information
    from the text parser.

  • Intonation: A sentence has a prosody. Not only in questions "How do you do?"
    but also in “I like beer.” or “I like beer if it’s after work”. The prosody is
    the “melody” in the sentence. Some speech synthesis models capture that
    information, other models may require that information from the text.

Speech synthesis

The speech synthesis block deals with the generation of the wave sound. The
traditional workflow for speech synthesis training consists of getting some
speech recordings of known, phonetically balanced sentences and use those
recordings to compose a new wave.

There are many methods that differ on the fundamental approach to build waves
(concatenating segments of audio records vs creating a statistical model for
each contextualized phoneme vs others). The election of the speech synthesis
has implications on:

  • The latency (computation required to generate speech)
  • The memory requirements
  • The disk requirements
  • The voice intellegibility (if it is easy to understand),
  • The voice naturalness (if it sounds as a real human)

This is a list of speech synthesis engines I have heard about or I have worked
with. I could try to find some examples for some of the engines or try to make a
chart of pros and cons of each of them if you are interested.

Based on concatenation

- diphone based speech synthesis: Phonemes are not always pronounced equally.
  In a diphone speech synthesis engine, we try to capture all possible pairs of
  phonemes from recordings and we concatenate those phonemes to create new
  wave sounds. If the diphone we want is not in our recordings we use a similar
  one based on phoneme features. *Sounds a bit like a robot* (read with robot voice)
- clunits: Based on cluster unit selection algorithm. (Black 1997)

Based on statistical models
  • HTS: Hidden Markov Model Speech Synthesis. Instead of concatenating recorded
    segments we generate a model of how the recording sounds for each
    contextualized phoneme. A known FOSS engine is the hts_engine and there is
    a version of flite+hts_engine available.
  • clustergen: Statistical parametric synthesis engine.

Speech reproduction

This block deals with the steps required to go from a waveform to the actual
sound that comes through the speakers. It can be fairly simple:

  • Open audio device
  • Write to audio device
  • Wait
  • Close

But it can get complicated:

  • Latency
  • Asynchronous output
  • Pause / Resume.
  • Pause / Say another thing.
  • Repeat the last sentence (“Mycroft, can you repeat please?”)
  • Volume control
  • Integration with other audio systems (recording/speaking at the same time…)

Thanks, interesting post.

I have a question about the speach analyzer. It will be on the cloud right ? On your servers ?
Is the sound file streamed to your server or sent after recording ?

I think the SoundHound’s stt engine works streaming the sound file so it start analyzing as soon as you talk and therefore is really, really fast.

Sorry if I’m off topic (?)

I’m sorry, I am just helping in my spare time with the speech synthesis (converting text to speech). You seem to be asking about speech recognition (speech to text). I don’t know the details about mycroft’s speech recognition system.

Oh yeah sorry, I’m off topic here.
Maybe I should start a new thread (but I created too much already :D)
Hounds is very impressive : anyway…

1 Like

If only Hound was Open source :frowning:

Still, it’s very impressive.

1 Like

@zeehio - Thanks for taking the time to write this up.

Joshua Montgomery

1 Like