Deciding on a far-field voice recognition DSP

November 14, 2017 //By Leor Grebler
Hardware makers are now looking at bringing the next interface to their products - voice. But the far-field use-case is demanding with many engineering possibilities to consider.

When it comes to creating a delightful experience for the user, there can be a lot of challenges that block the ability for someone to have a reliable interaction. Reverberation, ambient noise, and focusing the listening on the speaker have created insurmountable problems in the past. Today, there are a handful of turnkey technology providers who provide far-field digital signal processing solutions and quite a few more are being introduced, driving down the costs. Hardware makers are still left with the daunting task of deciding what types of microphones to implement, how many, how to place them, and a number of other factors that must be addressed in making far-field voice interaction work well.

While there are many moving parts to voice interaction that can make or break an experience, such as; latency, accuracy of the natural language processing, and the synthesized speech response, one of the most critical pieces is accurate speech recognition.

Far-field DSPs aim to improve the accuracy of speech recognition in three ways:

  1. De-reverb. Sound waves reflecting off of objects and surface in a room can arrive at the microphone at different times, creating an echo or reverberation. The further a device is from the speaker, the more likely the microphones will pick up an echo. Multi microphone far field solutions look for a match between different sound signals and knowing the distance between the microphones, can buffer, merge, or just ignore echoing signals. These echoes tend to confuse speech recognition. If you were to look a spectrogram of speech, with reverb, the image would appear smudged, making it hard for systems to identify words and sounds. Some single mic DSP solutions attempt de-reverb through software algorithm versus hardware processing.
  2. Voice Activity Detection and Active Gain Control. Essentially, this filter runs on the DSP and listens for something it can identify as human voice. It will then increase the gain (volume) of the audio signal and try to ignore the rest of the noise in the environment. This makes for a better signal to noise ratio and easier for speech recognizers to identify sounds and words. Actual deployment of these filters can range from simple to extremely complex and can be one of the reasons why DSP code tends to be run on specialized processors.
  3. Beamforming. Beamforming is the attempt to identify the direction from which a voice signal is arriving and then to ignore all other signals. The result is both the reduction of echo signal as well as an increase in signal-to-noise ratio. Typically, this algorithm requires at least two microphones and can be made more accurate with an array of microphones in different orientations. The narrowness of the beam is also determined by the number of microphones, with more microphones leading to more granularity.

Next: Three microphone options