PART 1: ALEXA AND ARTIFICIAL INTELLIGENCE
“ALEXA, THE FUTURE OF AMAZON” is a series of two articles with the purpose of understanding the potential of voice-enabled devices. The first article will look closely at how the AI personal assistant functions and how Amazon leverages the technology. In the second article, Alexa’s challenges will be addressed and predictions will be made regarding the future of the virtual assistant.
Alexa is a cloud-based AI personal assistant which was launched through the “Amazon Echo” back on November 6th, 2014. At first, its purpose was to activate the speaker with voice command, and much like other voice assistants at the time, it could answer basic questions from users. With soaring demands from Amazon customers over the years, the company has been pouring substantial resources into the project. Supported by the growth in accuracy and performance from Natural Language Processing (NLP) and Machine Learning (ML), Amazon has been able to combine its cloud computing power from AWS with an enormous amount of data collected over time. Six years after the launch of the “Amazon Echo”, Jeff Bezos’ vision for a world controlled by voice is becoming increasingly plausible.
On top of being the voice of millions of Amazon hardware devices such as the Fire TV, the Echo Dot, or the new Echo Buds, Alexa is a technology that can be incorporated into existing products from third parties. By using the same framework as Amazon Web Services (AWS), the company came up with Alexa Voice Service (AVS). With AVS, any device with a WIFI chip and a microphone can become Alexa-enabled. This opens the door to manufacturers around the world to build around Alexa and/or for existing product lines to upgrade their value proposition by importing a powerful voice assistant. Furthermore, the second side of Alexa, the “Alexa Skills Kit” (ASK) has multiple capabilities through thousands of “skills”. Alexa’s ecosystem is powered by a strong developer community through the ASK. Anyone can build on top of Alexa by adding a custom feature that allows the selection of skills to grow every day and makes Alexa exponentially smarter.
Amazon’s first Alexa enabled device was the famous “Echo”. Through this smart speaker, the company reaffirmed its positioning as a hardware manufacturer, despite the Fire Phone disaster. Both devices, developed around the same time by Amazon’s Lab126, show that the company is not afraid to innovate and conquer new markets.
Since then, Amazon has abandoned smartphones and has rolled out dozens of new Alexa-enabled products. The most popular device to consumers is still the smart speaker but new categories have emerged. The Fire TV is making its place in our home as well as Echo home products. Present in the car with the dashboard Echo Auto, Amazon is even coming up with Alexa-enabled wearables devices like the Echo loop (ring) and Echo Buds (wireless headphones). Amazon does not disclose product sales, but the company says that over 100 million Alexa devices have been sold.
AVS is one of two main factors of Alexa’s successful growth. Similar to AWS, it follows a similar framework of leveraging internal resources by making them available for outside business. The service relies on third parties’ desire to integrate Alexa into their product solutions. With the help of AVS, businesses “can build natural voice experiences that offer customers a more intuitive way to interact with the technology they use every day”. It is a simple and affordable way for a company to turn its products into smart devices. Being a cloud-based voice service, clients are guaranteed to be equipped with cutting edge technology as Alexa will continue to learn and be updated.
The platform provides many different tools, API’s, and documentation to facilitate the integration of Alexa. There are 2 options around AVS:
– The first option is to connect to an existing device. Companies like Sonos, LG, or Dell have already jumped on that opportunity. The list of compatible devices is long; kitchen devices, smart clocks, gadgets, speakers, cameras, and thermostats are just a few examples.
– The second option is the Alexa built-in solution. As one of the leading Original Equipment Manufacturers (OEM), Amazon collaborates with brands to incorporate its voice assistant inside other companies’ products. This way, Alexa can follow you in your smart vehicle, it can be added to your personal computer and smart TV. Once again, the list of compatible options is long making Alexa built-in an asset for AVS.
AVS is well-positioned to forever transform industries. Through a simple implementation of Alexa’s state-of-the-art voice service, all companies can now respond to consumers’ need for innovation.
ASK is the second main factor of success for Alexa. Virtual assistants are numerous, from Siri to the Google Assistant, the answer to the question of which is the “smartest” must be deduced through analyzing their capabilities, which are used to fulfill the user’s need. Skills can be defined as a virtual assistant’s capabilities; they are how to engage with consumers and the content people have access to. Most of Alexa’s Skills are given by a very large community of developers providing Alexa with a competitive advantage. In September 2019, Alexa Skill Store had over 100,000 Skills such as Smart Home Skills (ex: “Change the temperature on a thermostat”), Video Skills (ex: “Pause, rewind, or fast forward video content”), and Custom Skills (pretty much anything you can think of). The ASK was made intuitive and accessible to developers with unlimited opportunities to grow business or to start new ones with Alexa Games, Apps, Briefings, and Smart Home Control.
By allowing anyone and everyone to easily build Skills, Alexa is getting smarter every day.
Alexa, much like other voice assistants, is powered by NLP and heavily relies on ML & DL trained models. The complex architecture of Alexa is what makes her so efficient. The visual below synthetizes the command to activate the device and to ask her to perform a specific task.
Source: Alexa Developers Resources
The first reality to be aware of before starting to analyze how Alexa executes a request is simple: Alexa is always listening. This however does not mean that she is always recording nor sending information to the Amazon server. The device is actively listening for its “wake word” through a feature called Voice Activation Detection (VAD). Wake word detection is a crucial step. The misinterpretation of noise can lead to “false positive” (the activation of Alexa when it was not indented) or “false negative” (Alexa does not activate) causing discontent from the user. Today’s wake word options on the devices are “Alexa”, “Amazon”, “Echo”, or “Computer” and now “Hey Samuel” (for Samuel L. Jackson latest Alexa partnership).
For Amazon, “it is challenging to build a wake word system with low error rates when there are limited computation resources on the device and it’s in the presence of background noise such as speech or music”. To reach the highest level of accuracy on wake word detection, Amazon combines multiple microphones, a short memory buffer, and Machine Learning.
To pick up wake words coming from multiple directions, voice assistant speakers are equipped with multiple microphones. The advantages of relying on multiple microphones are numerous; from a better capture of commands to locating where a voice comes from, the more microphones, the better. This feature allows smart devices to follow your voice and to cancel noises from other directions.
Another feature that contributes to a better wake word detection is the short memory setting for Echo devices. Amazon’s Head Scientist of Alexa AI, Rohit Prasad, explained, during an interview with Quartz, “the devices are intentionally limited technically, so they don’t have the capability to listen to your conversation” and that Alexa only stores “a few seconds, just long enough for the wake words” before deleting the recording. Through this process, Alexa’s short attention span allows it to only focus on the wake word.
Lastly, to prevent false negatives, Amazon uses machine learning to train their wake word detection model. The model is fed with multiple instances of the wake word combining different pronunciations, accents, situations, and other variables that could help to predict the output. By using supervised learning, the model is fed with instances of the wake word, then the mapping finds the correlation between the parameters and the past outputs. By combining millions of parameters and past outputs, the ML model can predict with very high accuracy when the user intends to activate Alexa.
More specifically, Amazon scientists declared that the wake word performance of Alexa relies on Deep Neural Network (DNN), or deep learning, a subset of machine learning that is essentially an Artificial Neural Network but with multiple layers. The objective is to match voice patterns through “Neural Network Training”. Each word recorded during the wake word detection process goes through multiple layers of algorithms. The purpose of the multiple layer system is to rule out all possible “false positives”. It is only possible for Amazon to use such a technique due to their powerful computing power, large cloud storage space, and Amazon’s enormous amount of data collected over the years. Each time a word recorded successfully matches the output of that layer, it moves into the next one. Once the word recorded is confirmed to be the wake word, Alexa then starts to share the audio with Amazon’s cloud server.
Once the wake word is detected and confirmed, Alexa then connects to Amazon’s cloud server to analyze the user’s command. For users’ requests to best be treated, a simple formula is required following the wake word. First, the “launch word” is supposed to be the start of the command, the first step to launching the Skill. It can be common verbs like “start’, “ask” or “play”. Then comes the “invocation name”. The invocation name is the name given to the specific Skill and it can be anything the developer wants it to be. From a generic word to multiple words, developers have carte blanche. It is even possible to use words that already exist for another invocation name as Alexa will ask the user which Skill they want to launch. With only these two steps, Alexa can launch the Skill. However, it is possible to be even more specific in the command. Users can add “utterance” to give a command inside the Skill and even add a “slot”. Adding a slot to the voice command is like adding a variable to a formula. Adding these two steps to a user’s voice request will give Alexa all the tools to fully fulfill the user’s need.
Once the command is spoken by the user, many powerful tools are put in place in order for Alexa to successfully respond. The most important systems responsible for the processing of the user’s request are signal processing, Natural Language Processing (NLP), and Machine Learning. Most of these systems take place in the Alexa Service Platform (ASP) in the Amazon cloud server.
Before processing the audio, the recording needs to be “cleaned” for the Automatic Speech Recognition (ASR) engine to be accurate. When speaking to Alexa, it is very common that the user is in a noisy environment or that the smart speaker is playing music; for that reason, Alexa needs signal processing to clean the audio. To do so, it relies on different systems such as Acoustic Echo Cancellation (AEC) or Beamforming. First, Alexa has to tackle “acoustic echo” which refers to the external signal recorded with the user’s voice. These signals mostly come from the loudspeaker and need to be subtracted from the microphone signal. To deal with the issue, Amazon uses AEC, an algorithm with the goal “to remove the acoustic echo component from the microphone signal, so that the customer’s voice can be clearly understood by the ASR engine”. This technique is applied first on the device then it is more deeply put to use once the audio is sent to the cloud server. In addition, for an optimal signal, the device relies on “beamforming” to emphasize the voice of the user. This signal processing technique focuses the recording in one desired direction and gets rid of the noises from other directions.
Now that the audio is clean, Alexa will combine Automated Speech Recognition (ASR) and Natural Language Understanding (NLU) through the Amazon cloud server to understand the user’s request. As one of the first steps in the process of understanding the user, ASR is responsible for converting spoken words into text. The algorithms will compare audio waveforms to an existing database to match the sounds with a language. Once the algorithm recognizes the language, the technology can identify which word was pronounced and turn it into text. However, the text is not a synonym of meaning for Alexa. That’s when NLU, a subdomain of Natural Language Processing (NLP) comes in. Natural language understanding is “essentially a technology that translates human language into computer language”. More specifically, NLU converts the text from ASR into “intent” for Alexa to act on. Amazon develops their algorithms with the purpose to not only understand the word we say but also the context behind what we say. Before the technology, voice assistants would have had to be mostly hardcoded with thousands of inputs for the algorithm to understand that “What time is it?”, “Give me the time?” or “Time please” all meant the same thing. Today, NLU learns from past conversations and manages to extract meaning from human language. Nevertheless, it is important to note that throughout the speech to text and text to meaning process, ASR and NLU are both powered by machine learning algorithms.
Finally, now that Alexa understands the user’s request, it is now time to act on it. The final step is Natural Language Generation (NLG) powered by Deep Learning (DL) models.
NLG is now in charge of transforming structured data into natural human language. To do so, Alexa’s decision-making process relies on one DL algorithm called Recurrent Neural Network (RNN). Unlike Artificial Neural Network (ANN) also called “Feed-Forward Neural network”, which consists of 3 layers where inputs are only processed moving in the forward direction; RNN architecture can use the generated output as a future input in loop process (see image below).
Source: Analytics Vidhya
RNN is designed to recognize patterns in sequences of data like spoken words or text. It is a very useful network for Alexa because it remembers every piece of information through time creating better predictions of what the user wants. These algorithms also have the capabilities of taking time into account since “they have a temporal dimension” making it more efficient. However, the network has a short-term memory. To mediate the issue, it relies on Long Short-Term Memory (LSTM), a more complex neural network capable, through a mechanism called “gate”, of dealing with long term dependencies and getting rid of the short memory issue. Once the final output is defined, Alexa will execute the required Skill. Thanks to NLG, the concept of responding to the user was put into words. The final touch requires text to speech (TTS) technology for Alexa to vocally respond to the user’s request.
SECOND PART OF THE ARTICLE COMING SOON
Now that Alexa has no more secret for you, come back next week to discover Alexa’s Chalenges and the Future of the AI personal assistant in “ALEXA, THE FUTURE OF AMAZON (2/2)”.
 “Amazon wants you to be surrounded with Alexa—wherever you are” by MIT Technology Review, Sept 26th, 2019.
 “The Alexa Skills Store now has more than 100,000 voice apps” by Venture Beat, September 25, 2019.
 “Zero to Hero, Part 1: Alexa Skills Kit Overview” by the Alexa Developers account, February 6th, 2020.
 “Alexa Scientists Present Two New Techniques That Improve Wake Word Performance”, Amazon Science Blog by Minhua Wu, April 12th, 2018
 “Alexa doesn’t have the attention span to secretly eavesdrop on your conversations”, QUARTZ, by Dave Gershgorn, November 8th, 2017.
 “Alexa Scientists Present Two New Techniques That Improve Wake Word Performance”, Amazon Science Blog by Minhua Wu, April 12th, 2018.
 “AI & Business – Module 1: General Introduction”, Jonathan Moioli, Fall 2020.
 “CNN vs. RNN vs. ANN – Analyzing 3 Types of Neural Networks in Deep Learning”, Analytics Vidhya, ARAVIND PAI, February 17th, 2020.
 “Recurrent Neural Network (RNN) and LSTM” MEDIUM, Marvin Wang, June 16th, 2019.