Voice is becoming a pervasive way to manage and interact with everyday tech devices, going from initial adoption in phones and smart speakers toward smartwatches, cars, laptops, home appliances and much more.
Cloud platforms take most of the praise for enabling voice assistant services such as Amazon Alexa, Google Assistant or Microsoft Cortana – neglecting due credit to the increasing role that edge computing plays in enabling voice interfaces. A substantial amount of processing and analysis occur on devices themselves to allow users to interface with them by simply talking.
Voice-enabled devices are not constantly recording audio and sending it to the cloud to determine if someone is giving them an instruction. That would not only be a privacy concern, but also a waste of energy, computing and network resources. Having to send all words to the cloud and back also introduces latency and slows the responsiveness of the system. Today’s voice interfaces typically use keyword or “wake-word” detection, dedicating a small portion of edge computing resources (i.e. computing done on the device itself or “at the edge”) to process microphone signals while the rest of the system remains idle. This is a power-efficient approach particularly important to help extent usage time in portable battery-operated devices including smartphones and wearables.
When the always-on processing core handling keyword detection, usually a digital signal processor (DSP), finds a match with the expected word (e.g. “Alexa”), it wakes up the rest of the system to support functions requiring more computing power such as audio capture, compression and transmission, language processing and voice tracking.
Separating signal from noise
After a keyword is detected the device starts listening actively. At this time, the ability of the system to accurately interpret voice commands largely depends on how “clean” the voice reception is – which can be a challenge in a noisy environment such as a street, a party or a family room where kids are watching a movie or multiple people are talking at once.
A number of edge computing technologies help separate the user’s voice from other surrounding sounds. For instance, beam forming techniques process audio from multiple microphones in the device to focus listening in the direction where the user is speaking from – like a virtual directional microphone. If the user moves around, voice tracking algorithms running on the device can adjust the balance among signals from the microphones, so the focus follows the voice source.
Advanced voice-enabled devices also process inputs from the microphone array to suppress environment noise from user’s talk, similarly to the way this operates in noise-cancelling headphones. Smart speakers also use on-device echo cancellation technology to allow for “barge in” capabilities – which suppresses music and other speaker sounds from the microphone signals to help the smart speaker receive voice commands even when playing music loudly. So you can have great sound in a smart speaker but it’s listening should you want to change songs or order a pizza in the middle.
On-device artificial intelligence
Increasing edge computing capabilities in voice-enabled gadgets also support innovative features using on-device artificial intelligence (AI). Offline commands, for example, allow on-device language processing and execution of basic voice instructions when internet connection not available. This feature is already widely available in smartphones, helping users to set alarms and reminders even if the device is in airplane mode or out of coverage. Offline commands are particularly valuable in smart home settings – letting users turn lights on and off, change thermostat temperature or disable the home security alarm even during an internet service outage.
Devices with advanced edge computing power can also perform voice biometrics for user authentication. This capability prevents unauthorized users from making purchases or changing key settings using voice commands – so children don’t keep adding items to the shopping list or burglars can’t disable the house alarm by shouting at the smart speaker when the owners are on vacation.
On-device AI can also support audio classification for other uses beyond voice commands. A home security device can be trained to detect the sound of glass breaking and trigger an alarm, or a smart baby monitors can detect when a baby is crying and notify parents. Coupled with cameras, sound allows machine learning to put more context around people or events. As AI capabilities increase we’ll see other very interesting applications. In the case of home security, for example, the ability to analyze events on the device itself reduces the amount of data sent to the cloud to just critical alerts, increasing the speed and convenience of the whole system.
The demand for superior edge processing power in voice-enabled devices is driving adoption of heterogeneous computing architectures — integrating diverse engines such as CPUs, GPUs and DSPs into a single system-on-chip that assign workloads to the most efficient compute engine, thus improving performance, power efficiency and cost-effectiveness to support the wide array of devices that are embracing a voice interface.
While most of our interactions today are with phones popular smart speakers from Amazon or Google, as edge computing and AI become more powerful and prevalent we will see many more form factors and voice interfaces added to virtually anything, whether it is a router, an appliance or a lamp. It will still rely on the power of the cloud ecosystem behind it, but the devices themselves will also be a lot smarter and able to conduct many operations locally – making devices more responsive and convenient while saving time and the amount of data transferred.
This article is published as part of the IDG Contributor Network. Want to Join?