Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore.

15 St Margarets, NY 10033
(+381) 11 123 4567



What is Automatic Speech Recognition? | by Engati | Jul, 2021


Speech recognition is concerned with understanding human communication and recognizing and translating it into texts by computers. It is also referred to as speech-to-text translation as it converts human speeches into a text-based format. This system is a combination of:

  • Linguistics
  • Computer Science
  • Electrical Engineering, etc.

This technology gives machines the ability to understand human voice and translate them into other forms like:

  • Speech to text
  • Supplying them as commands to compute a process
  • Identify a user using the saved segments, etc.

ASR (Automated speech recognition) combined with IVR (interactive voice responses) can enable users to speak responses instead of typing them or pressing a button on their phones.

Speech recognition systems are mainly divided into 2 main categories which are:

  • Speaker Dependent.
  • Speaker Independent.

The speaker-dependent systems are structured in such a way that they need to be trained, which are sometimes referred to as enrollment as well. It’s working is pretty basic, the speaker needs to read the text or a series of isolated vocabulary into the system. The system will then process these recordings and associate them with text libraries. Systems that do not rely on vocal training are called speaker-independent systems.‍

The basic sequence of an event on an Automatic Speech Recognition software goes as follows:

  • As you speak, your voice is recorded by the software via an audio feed.
  • It then creates a file of the words you spoke into the device.
  • It, later on, cleans the file by removing all the unwanted background noises and normalizes the volume.
  • Further, it is broken down into phonemes, which are the basic building block sounds of language and words.
  • The ASR software then uses statistical data to analyze and deduct the words into complete sentences.
  • Once the above process is completed the ASR can understand the whole conversation and respond to you in a meaningful manner.‍

The 2 main types of ASR software are:

  • Directed Dialogue.
  • Natural Language Conversations.

Directed Dialogue:

Directed dialogue conversations are a simpler version of automatic speech recognition. It consists of a machine interface which has a series of yes/no type questions with extremely limited responses.

It can be found in automated telephone banking and other common customer service interfaces.

Natural Language Conversations:

Natural language conversations are complex and improved versions of ASR. Instead of having a limited option of words to use, NLP tries to simulate actual conversations. It allows you to have an open-ended conversation with them. You might see them in popular virtual assistants like:

  • Alexa
  • Google Assistant
  • Siri
  • Microsoft Cortana
  • Bixby, etc.‍

NLP is much more important than directed dialogue in terms of future developments in ASR technology. It works in a way that simulates human conversations.

Natural language processing software on an ASR technology consists of more than 60 thousand words. This gives it the possibility of having over 200 trillion possible word combinations.

This huge number of potential combinations makes it impractical for NLP automated speech recognition systems to scan its whole set of vocabulary and process each word individually. Therefore, natural language systems have been programmed to react on selected/tagged keywords that give context to longer requests.

Contextual clues help the system in quickly narrowing down the exact match of words that you are saying, to find the perfect response.

A good example to explain the would be if you use phrases like “what time is it, or what day is it?’’, the NLP system would focus on keywords like “time” & “day” to find the right response.

NLP works on two main mechanisms which are human tuning and active learning. Human tuning is the simpler version of the ASR model. It involves adding commonly used phrases it has heard during a conversation that was not initially in its vocabulary. The whole take is done manually and conversation logs are added to the ASR software. It is done to expand the comprehension of speech making it capable of answering new questions continuously.‍

Active learning is the second and more advanced/sophisticated version of the ASR model. It is usually used with NLP versions of speech recognition technology. Active learning unlike human tuning keeps learning, and adopting new words continuously. It is programmed to continuously keep learning from its previous conversations and keep expanding its vocabulary.

This software can also pick up more than one speech habits and can communicate in a better manner. What this means is, it starts learning human behavior and provides a personalized experience for them based on their likings.

Credit: Source link

Previous Next
Test Caption
Test Description goes like this