Researchers at Amazon are now converging on a new voice recognition training method and algorithmic model that should grow the capabilities of AI to include those requiring accurate recognition of emotional cues in a user’s voice.
The new method involves the use of public data sets, labeled according to “valence, activation, and dominance.” The former of those attributes is a commonly recognized measure of the positive or negative connotations in speech while “dominance” could be loosely defined as the intensity of the emotion in the same or how “in control” the speaker feels. Activation refers to whether the speaker is engaged or interacting passively.
The results of the proposed training method, now presented via an IEEE publication, are intriguing, if not groundbreaking. While some AI methods for training emotion recognition are already established, the new method utilizes an adversarial autoencoder and has proven to be significantly more accurate.
Tests conducted using inputs with sentence-level feature vectors showed emotion recognition accuracy improvements, in terms of valence, of around 3-percent compared to conventional methods. Shortening inputs to a sequence of vectors that were representative of acoustic characteristics pulled from 20-millisecond audio snippets showed an improvement of 4-percent.
No human assistance required
As described by Amazon, the utilization of an adversarial autoencoder — an encoder-decoder neural network that learns to produce compact representations of input — is an important part of the new methodology.
The first phase of the process involves sending in data without labels while the second step of the process uses adversarial training. Effectively, that’s to create a compact representation of the input that is explicitly dedicated to the emotional state recognition and fine-tuned in further representations by feeding things through adversarial discriminators.
Adversarial discriminators attempt to differentiate between real data produced by the encoder as compared to probability distributions, ensuring the final model isn’t too narrowly trained and doesn’t rely too much on statistical probabilities.
Finally, the tuning is tested to ensure it is accurately predicting emotions by testing it to predict the labels of the training data. The entire process is repeated until an appropriately accurate AI model is found.
The biggest takeaway from the new method is that it doesn’t seem to require “hand-engineered features” and it certainly doesn’t require any outside input beyond the data sets. In the test case in question, a data set comprised of 10,000 voice recordings from 10 different speakers was used. In other words, it relies entirely on the neural network rather than hand-labeled data sets based on a known emotional state.
That should come as a relief to privacy advocates, following several controversies centered on Amazon’s use of real human helpers to assist its AI and gauge the performance of its Alexa AI assistant. It also means that the accuracy isn’t dependant on methods that could be riddled with human error due to guessing or having to be told about the emotional state of speakers in the training data sets.
Practical implications
Looking beyond the science and methodology behind the improvements, there are several fairly considerable implications stemming from better recognition of emotional states for AI. Although the method will almost certainly be applied to Amazon Alexa at some point in the near future, the implications extend far beyond the company’s own uses to other artificial intelligence implementations too.
Amazon’s own examples include everything from health monitoring to just making conversational AI better. In terms of AI assistants such as Amazon Alexa, it could assist in adding some context to complaints or feedback too.
Taking the first of the company’s examples, for instance, chatbots already exist that are explicitly designed to help people suffering from complex emotional issues. The new AI training method could potentially be utilized to ensure those implementations can recognize and better respond to real emotional states. Conversely, the method could train AI for use in a hospital or clinical environment to help practitioners better gauge how patients are engaged emotionally. Services could be altered to give those patients a more comprehensive level of care.