Voice control is now becoming a popular interface with hands-free capabilities making daily tasks easier and quicker. How exactly does this innovative technology work for your home to magically respond to your every command? Here are 16 voice control keywords that will help explain how it all works —
1. Far-Field Microphones — Personal computing devices have had microphones for a long time, but they don’t work well from far away. Far-field microphones, on the other hand, are an array of mics that utilize their location in space to amplify and reduce signals. This makes it possible to speak from across the room in a “hands-free” environment. By suppressing certain surrounding noises in the environment, these microphones utilize algorithms to help deliver a clear and easily understandable signal. The magical far-field voice experience is enhanced by other technologies, defined below, which include barge-in, beam forming, noise reduction, acoustic cancellation, and automatic speech recognition. Because this array utilizes the distance between microphones in its calculations, it’s hard to make these devices smaller than a minimum threshold.
2. Barge-In — Imagine playing music or watching TV with a nearby far-field microphone. Trying to yell over the noise can be quite difficult. This is where “barge-in” technology comes in. With “barge-in,” the listening microphone is aware of the audio source and able to digitally remove it, thus reducing noise and increasing accuracy. Amazon Echo is a great example of this technology. Saying “Alexa” while it’s playing music will interrupt the music and alert Alexa of your next command. Unfortunately, this is really difficult if the music source is external to the microphone, but expect that to improve over time.
3. Beam-Forming — Imagine you have a far-field microphone in a room with a TV on one side and you on the other. Even if the TV is relatively loud, beam-forming technology enables the microphones to amplify your speech and reduce the noise from the TV, effectively making it easy to be heard in a loud environment. This is particularly useful in automotive applications where the driver is always in a fixed location and noise in front of the car can be reduced. Unfortunately, if you take the earlier example and move to stand next to the TV, beam-forming won’t help discern your voice from the TV, which is why beam-forming by itself is not a perfect solution.
4. Microphone Array — We’ve mentioned this term a couple times, but it’s important to define as a standalone term. A microphone array is a single piece of hardware with multiple individual microphones operating in tandem. This increases voice accuracy with the ability to accept sounds from multiple directions regardless of background noise, the position of the microphone, and the speaker placement.
5. Automatic Speech Recognition — Often abbreviated as (ASR), is the conversion of spoken language into written text. When you say “Hey Siri” and follow with “…send a text,” you’re watching ASR in action. In other words, speech rec (as it’s sometimes shortened) makes it possible for computers to know what you’re saying.
6. Speaker Recognition — Although easy to confuse with SR, speaker recognition is the specific art of determining who is speaking. This is achieved based on the characteristics of voices and a variety of technologies including Markov models, pattern recognition algorithms, and neural networks (defined below). Another term you might hear related to speaker recognition is “Voice Biometrics,” which defines the technology behind speaker rec. There are two major applications of speaker recognition — (i) verification which aims to verify if the speaker is who they claim, and (ii) identification, the task of determining an unknown speaker’s identity. According to Wikipedia, “In a sense speaker verification is a 1:1 match where one speaker’s voice is matched to one template (also called a “voice print” or “voice model”) whereas speaker identification is a 1:N match where the voice is compared against N templates.”
7. Markov Models — Rooted in probability theory, a Markov Model uses randomly changing systems to forecast future states. A great example is the predictive text you’ve probably seen in your iPhone. If you type “I love,” the system can predict the next word to be “you” based on probability. There are 4 types of Markov models, including hidden and Markov chains. If you’re interested in learning more, we suggest the Clemson University Intro to Markov Models. Markov Models are very important in speech recognition because it’s similar to how humans process text. The sentences “make the lights red” and “make the lights read” are pronounced the same, but understanding the probability helps assure accurate speech recognition.
8. Pattern Recognition — As the name suggests, this is a branch of machine learning that utilizes patterns and regularities in data to train systems. There’s a lot to pattern rec, with algorithms aiding in classification, clustering, learning, predicting, regression, sequencing, and more. Pattern recognition is very important in the field of speech recognition and understanding what sounds form what words.
9. Artificial Neural Networks — A computer system modeled on how we believe the human brain to work, neural networks utilize artificial neurons to learn how to solve problems typical rule-based systems struggle with. For example, neural networks are imperative for facial recognition, self-driving cars, and of course voice control. For a great, if not highly technical, article on how neural networks are used in speech recognition, see this post by Andrew Gibiansky.
10. Natural Language Processing (NLP) — When a computer can analyze, understand and derive meaning from human language, it is utilizing Natural Language Processing. NLP covers a range of applications including syntax, semantics, discourse, and speech. For example, consider this named entity recognition example from Stanford CoreNLP:
11. Natural Language Understanding (NLU) — is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension. This gives the user flexibility when speaking to the system, as it understands the intent. Whether you say, “turn off the lights” in “my room,” “the bedroom,” or “Alex’s room” — you can get the same desired result. NLU focuses on the problem of handling unstructured inputs governed by poorly defined and flexible rules and converts them to a structured form that a machine can understand and act upon. While humans are able to effortlessly handle mispronunciations, swapped words, contractions, colloquialisms, and other quirks, machines are less adept at handling unpredictable inputs. In other words, NLU focuses on the machine’s ability to understand what we say.
12. Anaphora Resolution — is the act of recalling earlier references and properly responding to their associated pronouns. By saying, “Turn on the TV,” and later saying, “Turn it up,” there is an implied understanding that “it” is in reference to the TV’s volume. This is very important when it comes to natural speech control, particularly in the home. You can see a video demoing some anaphora examples below:
13. Compound Commands — is the ability to understand and process multiple commands uttered in a single breath. For example, “Turn off the lights, stop the music, and watch Black Mirror.”
14. Virtual Assistant — A software agent that can perform tasks or services for an individual can be referred to as a virtual assistant. For example, a Chatbot is a virtual assistant that is accessed via online chat. By using NLU combined with automatic speech recognition, Alexa, for example, can act as a virtual assistant to complete daily tasks, such as ordering pizza or an Uber.
15. Voice User Interface(VUI) — You’ve probably heard the term GUI (graphical user interface). VUI is a voice user interface, which relates to how a user interacts with a voice assistant. As prevalent as they are becoming lately, VUIs are not without their challenges. People have little patience for a machine that doesn’t understand. Therefore, there is little room for error. VUIs need to respond to input reliably and gracefully fail when they can’t. Designing a good VUI requires interdisciplinary skills of computer science, linguistics and psychology. Constructing an effective VUI requires an in-depth understanding of both the tasks to be performed as well as the target audience using it. If designed properly, a VUI requires little or no training, and provides a delightful user experience.
16. Wake Word — When you say “Alexa” or “Hey Google,” you’re activating a wake word, also known as a hot word or key word. Typically wake word detection runs on the local device, which is why “always listening” devices need direct power and can’t be battery operated. Once the wake word is heard, the voice assistant is activated and speech is typically processed in the cloud. Wake words need to be fine-tuned in order to work un-trained with most users, which is why it’s tough to arbitrarily choose a wake word and expect it to work well. That said, when you do train a wake word, hard stop sounds like “k” in “okay” or “x” in “Alexa”, as well as multiple syllables, help increase the reliability.
This article was written by Bridget, who is part of the business development team focusing on integrator relations and marketing. Originally from Massachusetts, she worked at Savant for many years before moving to Denver to work for a large integrator, Xssentials. Outside of work Bridget likes to hike, travel, and check out new restaurants.