Amazon Expands Alexa's Speech Synthesis Markup Language

Amazon’s Alexa takes advantage of a specially made Speech Synthesis Markup Language to help developers make their Alexa skills sound more natural, and today, that language is getting five additional tags. The new SSML tags allow Alexa to whisper, change speech emphasis, take advantage of prosody, bleep out words as if they’re expletives on cable television, and read out things besides the text written, which could be useful for commenting on commits to an Alexa skill worked on by multiple people. The new tags each have their own variables, allowing developers to play around with a wide range of different options for Alexa’s voice.

The whisper tag is quite simple; it just makes Alexa whisper. This one doesn’t have any variables, and requires an opening and closing tag. The expletive tag works in much the same way, but instead of uttering dialog more softly, Alexa simply emits a bleep sound effect. The sub tag allows you to define what Alexa will say in the Skill’s code, while having the plain text read something else. This can be useful for comments in a team commitment environment, or for Skills that will be used on devices with screens, where a user may be reading content while Alexa speaks it. The emphasis tag comes in none, moderate, and strong varieties, which are self-explanatory. There is also a reduced option for the emphasis tag, which will reduce the emphasis on a word or phrase by speaking it quieter and faster. The final new tag in the set, prosody, allows developers to control Alexa’s pitch and rate of speed. When used with the volume tag, this can be used to customize speech to exactly the way a developer wants it.

These new skills integrate tightly with the impressive existing catalog that Alexa skill developers have at their disposal. Together with existing SSML tags, these new tags give developers an unprecedented level of freedom in customizing Alexa, and the opportunity to make it sound as human as possible. Interestingly, they’re being rolled out just after Google announced a new speech synthesis bot, dubbed Tacotron, which integrates and automates most of these new features and some others, through the magic of machine learning and neural networking.