Mycroft Technology Roadmap

steve.penrod · October 25, 2016, 7:13pm

#From Here to There: 2016 Mycroft Roadmap

One of the great things about what we are doing together as the Mycroft community is the breadth of the problem we are tackling and the huge impact it can have on the world. This is incredibly exciting, and daunting at the same time. But as a community we can break this down into smaller core technologies and tasks that improve the whole as each of the parts iteratively improves.

The Mycroft of today is built upon five Core Technologies:

Wake Word
Text to Speech (Mimic TTS)
Intent Parsing (Adapt)
Speech to Text (OpenSTT)
Framework (aka mycroft-core)

The first four Core Technologies have intrinsic value as stand-alone tools, enabling other efforts outside of Mycroft. The last pulls the rest together to build something even greater which enables the AI Assistant we all working towards.

In addition to those Core Technologies, the Mycroft experience is strengthened by additional pieces we are developing which make the whole valuable to everyone. Those are:

Mycroft Backend (API key management, deep learning dataset management, device management)
Skills Ecosystem
Individual Skills
“Enclosures” – Mycroft Mark 1, Linux desktops, Android, cloud/web, …

Finally, there is research that will be valuable as the above technologies mature. This is incredibly vast, but I think several specific areas are worth noting as they will rapidly become invaluable:

Machine learning frameworks
Emotional understanding

Sections below will go through each of these in more detail discussing how they each can improve. As a community effort each can and will improve at different rates, but the whole will benefit regardless of the order of the improvements.

State of the Art

Today’s Mycroft is a strong technical preview, but certainly not the AI assistant promised by sci-fi masters like Heinlein, Roddenberry or Scalzi. Mycroft is “fun and interesting” and will soon hit the milestone of “very useful”; both important milestones on the path to becoming a trusted and pervasive AI for Everyone.

The maturity of the core technologies vary. But these pieces being developed for Mycroft are currently in demand and already proving useful outside of the Mycroft system. Sonar GNU Linux, for example, is incorporating Mimic TTS into their Linux distribution for the visually impaired.

Mycroft, Inc. – a subset of the Mycroft community

Although not part of the technical roadmap, it is important to understand the distinction between Mycroft AI, Inc. and the Mycroft community. These two are closely related and often intermingled, but distinctly different.

As we began developing Mycroft, we quickly realized that what we wanted to build was too important to remain in the hands of one private entity. We drew a line between what we had been created that was core technology and what we needed to build to support our interactive assistant vision. The core technology was then made Open Source which allows others to leverage it in ways we can’t even anticipate, and contribute back to help everyone.

Mycroft AI, Inc. remained necessary for several reasons. One was the practical aspects that some entity needed to sign contracts and pay for servers and services that are needed for the unified Mycroft ecosystem we imagined. Another is to create a dedicated core team that could focus on building Mycroft technology, not distracted by a competing “day job”. And just as important as the building technology was the need to let the world know what has been build, for many who can benefit from it are not the typical Open Source users.

Work Breakdown

In the subsequent sections you will see many bullet points of work for each individual technology. Each of these bullet points can be tackled by an individual or small team. Some of these will be performed by Mycroft AI, Inc., but we expect many will be tackled by individuals with unique talents and interest in the particular area. In those cases, Mycroft AI, Inc. will just help coordinate and support that work.

Over the next several weeks and months we will flesh-out frameworks to help organize the community and achieve these goals. Of course, individuals and sub-teams are welcome to use whatever tools they are most comfortable with for organization and development, but if identifying and building tools has already been done then people can focus on doing the real creative and important work.

Individual Technologies and near-term goals

Wake Word

The current system for waking up Mycroft is good. But as the kick-off point for all voice interactions, it needs to be great. Work in these areas will all help achieve that:

Improve false positive/false negative triggers
Easy wake word customization
Voice registration / voice printing for user identification
Extend for cloud-less commands (“Stop”, “Pause”, etc.)
Come up with a catchy name!

Speech to Text (OpenSTT)

Of all the Mycroft technologies, OpenSTT is still in the earliest stages of development. Initial testing has been performed on top of the Kaldi ASR engine with high-quality results in English. Further validation needs to occur here, however, to achieve the type of high accuracy, low latency results users have come to expect from proprietary technologies like Siri or Google Assistant.
Other pieces needed to build a strong STT engine:

Training data collection and preparation
Multi-language support
Language/Accent detection
Mixed language support

Intent Parsing (Adapt)

Accurately determining user intent from conversational interaction is critical to providing Mycroft users with a high quality user experience. The current Adapt engine uses a rules-based approach which is an excellent solution for many applications, but just the beginning. Improvements to come include:

Implementing deep learning to run in parallel with known entity approach
Interface for training deep learning intents
Disambiguation of training sets
Designing conversational interaction to refine intents

Text to Speech (Mimic TTS)

Mimic is already good, but still not indistinguishable from a human. Yet.

Build more voices in English with our partner VocaliD
Support for global languages
Enhance expressiveness (SSML), prosody, cadence and tone
Performance: Phrase caching
Performance: Pre-loaded Mimic (PyMimic)

Framework (aka mycroft-core)

Enhance Skills API (much more on this below)
Mechanism to run most of core in cloud for limited power systems
Performance: Pre-process based on STT hypothesis
Securing the framework: data isolation and preventing malicious activity

Mycroft backend

Skill submission/deployment system (Skill Store)
Account/device/skill management and customization
Generalized OAuth mechanism for Skills
eCommerce support
Access to alternative STT/TTS engines
Self service API access for organizations/developers

“Enclosures”

The Enclosure is the embodiment of a Mycroft – the portal that lets you access to your personal Mycroft. Each of these embodiments has unique capabilities – the knob on top of a Mycroft Mark 1, the screen of a Ubuntu desktop, the GPS and accelerometers of an Android phone. Enclosures should hide these differences for most Skills that don’t require special hardware, while allowing other Skills to exploit those unique capabilities.

Currently the mycroft-core includes some assumptions about the enclosure. Efforts to support Android and Ubuntu have required changes to the core. These efforts need to be unified.

Virtualize the concept of Enclosure
Ubuntu/Fedora Desktop enclosure
Android App enclosure
iOS enclosure
Web enclosure

Skills Ecosystem

The key to the Mycroft system is the Skills, as its value will grow with every Skill added. Ultimately these Skills will be built by people who have yet to hear about Mycroft, and the process of needs to be as easy as possible while still giving them power to build anything they can imagine.

Skill API (see entire section below)
More well documented examples
Tools to help Skill creation (forums, Skill Ideation Hub)
Automated testing/validation systems

Individual Skills

This section could be pages long. Here are some of the basics to get things started.

IFTTT
Google Calendar
TED
Spotify
Mopidy
Wink
…

Research Projects

These pieces aren’t strictly needed for the first stages of Mycroft. But work in these areas can quickly be applied to enhance the framework.

Machine learning framework
Emotional understanding (analyzing inflection, words) as context
Integrating other input (camera, biometrics, networks) for context

Skills API

The importance of Skills to Mycroft cannot be understated. The utility of the system is dictated by the both the number of quality of Skills. And the rate at which those are created is dictated by the elegance and power of the tools provided to the Skill authors.

Architecture

Skill manager to negotiate intent keyword overlap (disambiguation)
Break intent handling into two stages: parsing with confidence levels; and performing action
Support multiple STT interpretations with probabilities
Groups and intra-group communication

Context

Provide access to context (user, location, history, environment)
Bluetooth user recognition
“Converse” stage for active Skills before normal intent parsing
Conversation scripting/branching

General API Tools

Alternative keyword definition (flexible, simpler, non-Regex)
Easy state save/load
Parsing tools: Date/time extraction, Location extraction, Number extraction
Formatting tools: ‘Nice’ time/date formatting, spoken numbers
Easy HTTP GET (with caching), POST and DELETE
Built-in JSON and XML parsing tools
Generalized monitoring of GPIOs (callbacks on state change)
Callbacks for non-voice events (time-based, user arrival, etc)
Generalized time events
Enclosure capabilities exploration/access

Backend Integration

Metadata description for skill option editing
Required Enclosure capabilities system

Conversation Mode

Constant listening for a few seconds after being woken-up or any interaction

Screen Support

Nearby screen association architecture
“What is visible” context
HTML serving framework
Linux box (Openelec based?)
Roku adaptor
Chromecast adaptor
Other adaptors

Longer Term Goals

Extensive long-term planning has limited value – unanticipated change is inevitable. But thinking about where you can go next in general terms is important.

As an open, auditable, trusted collective we can leverage all of our voice interactions to rapidly accumulate volumes of data. Unrecognized requests can feed back into the system, associated with the recognized requests that immediately follow them. Users can volunteer to read known phrases to assist in building a corpus of accents. And it all can be stored in an anonymous manner but made available to all for research and unexpected uses.

Using the above data, machine learning frameworks can feed back into the STT systems. TTS may also be able to leverage this corpus of known human pronunciations to create more natural-sounding speech.

As we build move beyond support for just English, the TTS engines can be combined with the translations in the Skill vocabularies to begin automatic “translation” of commands. So a Skill that has been coded with only English in mind will still be able to handle commands spoken in German.
This truly just scratches the surface of possibilities.

Conclusion

Mycroft has been born and we take the nurturing of this technology seriously. They say it takes a village to raise a child, and this extraordinary entity is going take much more than that. We will endeavor to build the relationships, frameworks, technologies and systems needed to allow public and private organizations, businesses, groups and individuals to participate in this effort.

ezieger · October 28, 2016, 6:51pm

Catchy (punny) name for wake word project = “Hello Word”

steve.penrod · October 28, 2016, 8:29pm

Ha! I’m a fan of puns.

Other names floating around in my head were “Aloha” or “Achtung”. I’ve also been trying to figure out if any name can be made out of this scene…

Kallisti · November 18, 2016, 12:36pm

Hi,

I’m following Mycroft with a lot of interest, wishing a lot of success. We are at a point were the speed of computer interaction development using big data could create a future of internet interaction will primarily go through the big five (Google, Amazon, Microsoft, Facebook and Apple). Mycroft seems to be one of few projects poised to compete effectively while ensuring privacy.

However, to do that Mycroft must be more than a voice/speech system connected to Skills. I appreciate the scope of such an endeavor, which is why I don’t want all these efforts to be invested without an eye on the others. The next wave of personal assistants is the building of an understanding of the user. This is where the big-data companies are competing to lock us in in this is the big challenge going forward. Considering the rise of “fake news”, it has become even more important to avoid a future where editorial power is wielded by just five organisations.

I hope that the machine learning aspects of the Mycroft project are fully explored, if Mycroft is to have relevance in a year or two. It seems to me that if this is the direction for Mycroft, the next big challenge for the project is to web machine learning of the user with data structures under the control of the user and at the same time achieving accelerated learning of truly anonymised aggregated data.

Reading and re-reading the ambitious and comprehensive plan above, I am still not sure if Mycroft is supposed to be a voice control system or if the aim is to actually become a personal assistant comparable to the solutions from the big five. Could you clarify the roadmap on that point?

Many thanks for the huge amount of work already put into this project.

steve.penrod · January 23, 2017, 6:43am

Thanks for the feedback, Kallisti! I agree that the long term value of a personal AI assistant requires far more than just simple Skills. The roadmap that I laid out above is far more focused on laying the groundwork needed for the learning system we are envisioning. Things that we 100% need and will elevate the user experience of today. That kind of system needs effective and pervasive interface points, which is what mycroft-core is intended to provide.

We recently released our new backend and are focusing in the near term on making Mycroft available on a variety of platforms. But we still have a long term goal of building a 'strong AI". There are a lot of intermediate steps between here and there, and I think the “voice assistant” is a step that provides a lot of bang for most users. Once that is established it provides a great platform for solid interactions.

Chris_Schantz · January 27, 2017, 9:21pm

As someone with a speech impediment aka stuttering i have always found it hard to use any voice recognition software. If you guys feel it would help or are in need of any kind of voice recordings of that nature let me know. I am not offended by such things if it will server to help others in the future. Thanks for the project and i am looking forward to all of it.