The CommonPlay Skill Infrastructure

Originally published at:

We recently added a new piece of software architecture to Mycroft known as a CommonPlaySkill. This is the first of a series of “Common” infrastructure pieces which will make working with Mycroft much more natural and powerful.

What is a Skill?

First a quick review: a Skill adds new abilities to your Mycroft. Think of it like the scene in the Matrix where Neo learns Jiu-Jitsu. Plug in a skill and suddenly Mycroft has new powers. Skills have two primary pieces: intents which allow them to define patterns of words to listen for, and handlers which allow them to perform an action when the intent is heard.

For example, a simple skill can handle phrases like “tell me a joke”. The skill has an intent which spells out an interest in that phrase (along with related phrases like “I want to hear a joke”, etc). That intent is connected to a handler which looks up a random joke and has Mycroft read it to you. Hilarity ensues.

Why do we need CommonPlay?

Clearly, the skill system is really powerful! But it has an inherent limitation – it decides the handler purely on the word patterns. While I can easily define a pattern that captures the phrase “play something”, without a deeper understanding of that something Mycroft would be unable to distinguish which player to use purely from the words.

Here are some example phrases that illustrate the challenges:

play Zork
This one is easy – there is a game called Zork, just play it.

play the News
This one is easy too – fire up NPR!

Play Huey Lewis and the News
Looking at this naively (as if I’ve never heard Huey Lewis), is Huey Lewis a reporter or a singer? Which skill should handle this?

play The Latest Single by The Hot New Band
Even if I understand this is a song request, it is impossible to tell from these words which music service has the legal contracts in place to be able to play the music.

play Ragtime
Is this a band? A music style? A movie? Yes to all of these. What should Mycroft do?

CommonPlay Approach

A single skill (skill-playback-control) currently captures all of the “play *” style utterances, like those listed above. This skill will now query all the CommonPlay skills and give them an opportunity to respond with:
  1. I can potentially handle that request
  2. This is how confident I feel in my handling
After the CommonPlay skills respond, there are a few ways to continue. If only one skill replies, it is the winner and will handle the request. When there are multiple respondents, the highest confidence wins. If there are several with about the same confidence, we can ask the user to pick the winner.

Gory Details

As they say, the devil is in the details. How do you catch the query? How do you format the response? What does “confidence” mean? We wrapped all of this up in a class called CommonPlaySkill which itself derives from the familiar MycroftSkill. To participate in the CommonPlay system you only need to derive your skill from CommonPlaySkill and override a handful of methods. Here is the all you need to connect a News skill to the CommonPlay system.
def CPS_match_query_phrase(self, phrase):
    if self.voc_match(phrase, "News"):
        return ("news", CPSMatchLevel.TITLE)
def CPS_start(self, phrase, data):
    # Begin the news stream
That’s it. The first method responds to the CommonPlay query, responding to any phrase that contains the words “News”. The framework will generate a standardized confidence level based on the given CPSMatchLevel and the number of words in the phrase that were used in the “news” title match it found.

The second method is invoked by the framework if the query match is determined to be the best match.

You can see the entire News Skill on Github. It also has an intent which supports a few other non-“play” phrases such as “what is the news” and “tell me the latest news”. As you can see, it has all the capabilities of a regular skill in addition to being in the CommonPlay system.

I won’t bore you with lines of code here, but you can see more examples involving complex matches on the Pandora/Pianobar Skill and the Spotify Skill.

So Much in Common

This is the first of several “Common” skill frameworks I have planned. The CommonQASkill will allow Question and Answer skills to search their databases for answers and then present the best answer found. A good example of why this is needed is the question “How old is …”. From those words alone (not knowing the specific name) you can’t tell if the best answer would be in Wikipedia, IMDB, or Wookiepedia (a Star Wars knowledge base). It might even best be answered by a skill that tracks refrigerator contents – “How old is my milk?”. The CommonQASkill framework will allow each of these skills to look at the specific query and report back how confidently they can answer that question.

A CommonIoTSkill is also coming, making it easy to combine multiple types of Internet of Things systems. They can handle identical verbal requests such as “turn on the light” by looking at the context clues, such as the location of the Mycroft unit which heard the words.

Something for Everyone

Everyone is welcome to create a Common Skill. The framework will likely evolve, but by deriving from the CommonPlaySkill class, your skill will receive the benefits of this evolution. Play on!

Is there more reading / examples / information on the CommonPlay skill infrastructure?
I am trying to wrap my adolescent python brain (I am not adolescent, only my python skills) around how the transaction works between the CommonPlay system and my skill. What happens to the original intent builder structure? Do I just remove the word “play” from my original intent or does the whole utterance get passed on to any skill that is registered as a CommonPlay skill? Do I need to identify my skill utterance with something that uniquely identifies my skill such as “kodi”? Does the CommonPlay skill use my intent to determine the confidence or do I have to determine the confidence based on the utterance passed from CommonPlay. Sorry if I am confusing the issue but I would like to understand this a bit more before I begin carving up my existing skill(s) to support this.

hey there @pcwiii - @forslund is going to do a writeup / tutorial on this soon so that we have some more documentation available on CommonPlay.

The WIP can be found here:

Comments, Questions and suggestions are welcome :slight_smile:


And this doco is now live at;

1 Like

I have read through the documents and still have a question. When reworking an existing skill to utilize the commonplay structure is the original intent builder still applied to the phrase returned by coomonplay?

No it is not. The CPS_match_query_phrase gets the phrase and then has to handle all parsing of the string.


I tried implementing a skill that inherits from CommonPlaySkill and was surprised to find out that the latency is approximately 6-7 seconds before the skill receives the intent. This is an unacceptable delay and it causes me to wonder what is causing it. By contrast the latency for weather and time is only 1-2 seconds.

When the playback control skill is triggered by „play …“ it forwards the phrase to all Skills that implement CommonPlaySkill. Then it waits for a few seconds for all CP-skills to answer. Then it chooses the skill with best answer (highest score) and signals this skill to play (CP_Play) - which may take another second to retrieve the audio data stream for playing.
As you can see this rondtrip takes a few seconds…

Out of curiosity, how many Skills do you have loaded that use the CommonPlay framework?
Generally we wouldn’t expect people to have more than about 3 or 4 in normal usage.

I went back to reinstall everything from scratch. It had the npr news skill installed by default but I removed it so the only apparent CommonPlaySkill that is installed is my own. The skill does essentially nothing:

 def CPS_match_query_phrase(self, phrase):
     return (phrase, CPSMatchLevel.EXACT)
 def CPS_start(self, phrase, data):
     print('The phrase was ' + phrase)

When the skill is invoked, I see the following in the logs:
16:34:24.431 - Playback Control Skill - INFO - Resolving Player for: the who
16:34:29.008 - Playback Control Skill - INFO - Playing with: mymusic

This shows that there is an approximately 5 second lag to resolve the skill. I found that the smoking gun is in mycroft-playback-control.mycroftai/ where in the method
handle_play_query_response where it extends the timeout to 5 seconds. When I changed this to 1 second, it shrank the delay to 1 second. In turn this is caused by the __handle_play_query method where it has:

self.bus.emit(message.response({"phrase": search_phrase,
                                "skill_id": self.skill_id,
                                "searching": True}))

I don’t see why this delay is built into the search - clearly some skills may take longer than others to resolve it, but a fixed delay that long feels wrong.

Strangely enough the next thing that happens is that stop() is called on the skill, which is unexpected. I’m not sure what causes that, but I think I’ll switch away from the common play framework. I’m not planning to release my skill anyway until the system can work without

We’ll hopefully be able to shorten it a smidgen after PR #1889 is merged.

Basically the way things are currently, the CPS skills are checked in a serial fashion. (Basically if there’s a bus message that has several connected listeners they’ll be handled serially)

The PR changes the behaviour so the CPS skills are run in parallell which should cut down the safe extend time…

If it always takes 5 seconds for a single CPS skill however there’s a bug. I believe the intention is that 5 seconds or when all pending CPS skills have answered.

1 Like

Excited to try this one