Speech Technology
In order to analyze the majority of business interactions, technology must be able to understand the meaning of voice interactions, whether live or recorded. This process is affected by a variety of factors: the speaker's language, dialect or accent, as well as background noise or interference from the type of intercepting device, such as a mobile phone. Due to the variables in speech and language, legacy approaches like phoneme-matching and word spotting alone are not enough to determine what is truly being said. Autonomy's speech analytics delivers sophisticated audio recognition and analysis technology that processes spoken interactions based on their conceptual content, not just the way they sound. Speech technology has gone through several phases of innovation, each one building upon the shortcomings of the previous methods. Many remember using speech technology over the phone on some type of IVR system. These systems were able to recognize a limited number of keywords such as "Yes" and "No" or the number "5." If more than one word was present, these systems needed a pause of silence in between to differentiate the words. Unfortunately, conversational speech does not naturally have these pauses. The next evolution in speech technology was phonetic indexing. This was a fast way of finding matches as it was only looking for base phonemes and specific grouping of phonemes. Unfortunately, this system had many false positives as sometimes words may appear in other words, such as "cat" in catastrophe. It was also sensitive to background noise, bandwidth of the call and accents in particular as the phonemes will differ.
A language model was developed to address the issues presented by phonetic indexing. This technique used a dictionary and a pre-defined language model and gave a highly accurate recognition rate that could also find phrases. However, the language model could not distinguish between homophones (e.g. eye) and heteronyms (e.g. bass player ate bass) and was confined to the pre-set dictionary. To make the language model more flexible, a self-learning language model was developed that could learn new words. This improved model was highly accurate and could be set up without the need for the dictionary, but required massive computational requirements. Today, the latest generation of speech technology delivers conceptual search. This approach utilizes advanced mathematics and complex algorithms to derive meaning from speech. Conceptual search addresses the shortcomings of previous speech technology models and provides the most accurate way of recognizing and finding speech because it understands what is being said. It can distinguish between homophones, heteronyms and find and group things by concept, and find related information based on meaning. It also has lower computational needs. Phoneme Processing Phonemes are the smallest discreet sound-parts of language and form the basic components of any word. Phoneme matching attempts to break down words into their constituent phonemes and then match searched terms to combinations of phonemes as they occur in the audio stream. While this approach does not require a dictionary, it is limited in its accuracy and inability to make conceptual matches. Phoneme processing is a commonly used approach to audio recognition, but is frequently inaccurate and often returns high levels of false positives. Because words are treated simply as combinations of sounds with no awareness of their meaning in context, the system cannot differentiate between the required data, homophones, and phrases that share the same phonemes but bear no conceptual relation to the search terms. For example, the sentence "The computer can recognize speech" contains the same basic phoneme components as "The oil spill will wreck a nice beach," while the meaning is entirely different. Phenome processing cannot account for multiple expressions of the same concept, so any information that is related to the search term but does not contain the same phonemes will not be retrieved. As with phoneme matching, word spotting techniques search for words out of context, so they are unable to differentiate between homophones and homonyms. Because the system relies on exact sound matches, it is also unable to account for changes in pronunciation that affect sound, but not the actual concept behind spoken words, such as plurals. As with other purely phonetic approaches, keyword spotting cannot make conceptual associations and will frequently miss related information that is not included in the search terms.
|