January 19, 2021
By Peter Bartolik
The ease with which we use digital assistants such as Apple’s Siri and Amazon’s Alexa illustrates the potential of natural language processing (NLP) to create better human-to-computer interfaces. But the computers, smartphones and smart speakers on which we access those assistants exploit powerful processors, generous memory, and, generally, cloud connectivity that may not be available to many edge devices. As a result, scientists and developers are striving for more resource-efficient ways to process the spoken and written word.
Most humans communicate primarily through speech and text, and efforts to convert that through NLP date back to the 1940s. Turing’s seminal 1950 paper on testing what we now call artificial intelligence (AI) was founded in large part on whether humans could converse textually with a computer without realizing it is a machine. And in 1954, an IBM computer “within a few seconds” translated more than 60 Russian sentences “into easily readable English.”
By 2011, IBM’s Watson demonstrated its ability “to understand the actual meaning behind words, distinguish between relevant and irrelevant content, and ultimately demonstrate confidence to deliver precise final answers” by competing against humans in the “Jeopardy” quiz show.
Advances in machine learning, speech recognition technology, and access to ever-larger compute resources have progressed NLP to the point where in 2020 an estimated 128 million people in the U.S. used a voice assistant at least monthly and Google claimed a half billion users worldwide for its Google Assistant.
Connectivity Vs. Cost and Latency
Whether accessed by voice or text command, digital assistants can execute a range of functions, from dialing a call to roaming through online databases to extract needed information. But these assistants have little use when embedded in edge devices with limited resources and connectivity. embedded devices aren’t encumbered by the latency of cloud processing, and developers don’t have to bear the costs of running cloud instances or paying for cloud-based speech recognition services.
Kadho, developer of the KidSense.ai engine for children’s automatic speech recognition in multiple languages, asserted that “speech recognition API [application programing interface] calls usually cost from $4 for 1,000 API calls or approximately $.024 per minute of audio input.” The company was subsequently acquired in early 2020 by ROYBI and incorporated into the ROYBI Robot AI-powered smart learning tool.
The KidSense Edge offline automatic speech recognition capability and provides accuracy and little latency. Addressing the challenges and tradeoffs of embedded NLP, ROYBI CTO Ron Cheng said via an email dialogue that “The most challenging is the computing capacity and power consumption.”
Because the product is portable, it requires a low-power consumption chip to continuously enable voice and face recognition. In addition, the Smart AI engine can narrow down the dictionary automatically based on a child’s selection of specified topics and lessons, reducing the need for local compute and storage.
“There’s definitely incentive to do more locally,” said Dan Miller, lead analyst and founder of Opus Research, which focuses on the merging of intelligent assistant technologies, conversational intelligence, intelligent authentication, enterprise collaboration and digital commerce. “We’ve overcome a lot of the old challenges in just pure speech recognition, so you can get accurate recognition and you can get some human, less robotic responses, and you can do it locally.”
Jumpstarting NLP at the Edge
Companies such as Sensory and Expert.ai are providing developers with free tools to jumpstart efforts to create edge NLP applications. “The typical challenges in the embedded environment are memory size; processor load; and speed; and those are the dials that you turn to create an optimal user experience in an embedded platform,” said Joe Murphy, a vice president with Sensory, a pioneer of neural network approaches for embedded speech recognition in consumer electronics that was founded in 1994.
Companies in the consumer electronics market are trying to squeeze in more and more embedded AI features with fewer and fewer resources, he said, adding, “Doing more with less is basically what AI on the edge is tasked with.” One way to enable more features is to use specialized neural network accelerators for embedded devices.
“Many people assume there’ll be less accuracy with a speech recognition model that is embedded on a device versus one that is in the cloud, but in our experience that’s not the case,” said Sensory’s Murphy. “We create what we’ll call a specialist or a domain-specific assistant.” He contrasted that specialized, embedded NLP model to a cloud-based generalist assistant that lacks domain-specific context to understand a series of related commands such as “Start cooking” and “Stop.”
Developers have also become more efficient at developing edge NLP models and fine-tuning them for specific needs. Sensory, for example, is able to build voice NLP models based on computer-generated data and to reduce reliance on the more time-consuming and costly process of physically recording numbers of people speaking words and phrases in different dialects and languages.
NLP relies on large corpuses of data that enable deep learning neural networks and can be “gargantuan” said Alexander Wong, the Canada Research Chair in Artificial Intelligence and Medical Imaging and an associate professor in the Department of Systems Design Engineering at the University of Waterloo, as well as co-founder of DarwinAI. He and other researchers “have been looking at ways to greatly reduce the complexity of such deep neural networks so that you can actually have everything running on the edge locally on your device.”
NLP has already made great strides in sifting through large volumes of text to understand the overall structure of the data, interpret it and even infer human intent or emotion. Speech adds additional compute complexity as, according to Wong, algorithms first have to decompose a graphical representation of sound waves to understand what a person is saying and reconstruct a text that can be interpreted.
It may be a while before edge NLP to progress, where devices can engage in conversation with humans that lives up to the Turing Test, but Wong said it would be a mistake for developers and businesses to take a wait-and-see posture.
“We’re making huge progress,” he said. “The industries that see the most impact, of course, are consumer electronics and big tech, but it’s to the point where a lot of different industries from healthcare to manufacturing to automotive are paying serious attention.”
With NLP, like almost every other technology these days, the pace of change has accelerated. Applying this to the edge is guiding science and businesses in more efficient and cost-effective directions to a point where NLP can readily pull meaning out of the dialogue between human and machine and react accordingly in real time.
You May Also Like