HOW DO SPEECH RECOGNITION & NATURAL LANGUAGE PROCESSING WORK?
TYPES OF SPEECH RECOGNITION SYSTEMS
REAL-WORLD SPEECH RECOGNITION APPLICATIONS
Getting computers to understand normal speech has been a research goals for many years. Today, speech recognition can speed up information handling in almost any environment. This document covers the speech recognition technology and provides details on how to use this technology in your target environment.
We thank our GOD, who helped us in every field of life to achieve our objectives. We would like to thank our respected teacher Mr. Sajjad Ali, who guide and help us at every step for completion of this project. We would like to thank Sun Corporation and Mr. Aldebaro Klautau, who helped us completion of the project. We also thank Dr. Muhammad Nawaz for providing us the appropriate material regarding the assignment. And finally, we would like to thank our friends who helped us in completion of this assignment.
Fahad Habib
Baber Amin
Majid Bhatti
Bilal Rasool
Shahbaz Sarwar
User interface technology has followed an evolutionary path from punch cards to command lines today’s graphically user interface GUI. While the GUI is easier to use than ever, research indicates that the majority of the customers want yet greater flexibility in their tools. Users will increasingly determine how they work best with the computer, based on the demands of the particular task they perform as well as on their own preferences. The key to increasing productivity is to make the computer more flexible by expanding its interface capabilities, then giving people the freedom to pick the interface that they are most comfortable with.
As computers have become more pervasive in many parts of society it has become clear that most people have great difficulty understanding and communicating with computers. Often users are unable to simply state what they want done and must learn archaic commands or non-intuitive procedures in order to get anything done.
Furthermore, such communication is often accomplished via slow, difficult-to-use devices such as mice or keyboards. It is becoming clear that an easier, faster, and more intuitive method of communicating with computers is needed. One proposed method is the combination of speech recognition and natural language processing software. Speech recognition (SR) software is software that has the ability to audibly detect human speech and parse that speech in order to generate a string of words, sounds or phonemes to represent what the person said.
Natural language processing (NLP) software has the ability to process the output from speech recognition software and understand what the user meant. The NLP software could then translate what it believes to be the user's command into an actual machine command and execute it.
An effective speech recognition system includes the following:
Voice recognition:
It enables a system to recognize the speaker rather than simply responding to words being spoken.
Speech synthesis:
It enables the computer to speak.
When researchers began using computers to recognize speech, they employed a method called template matching. Template matching is a technology whereby the word or phrase to be recognized by the computer is dictated and stored in the computer's database. Thus an 'acoustical' image of the word or phrase is attained. In order for the computer to recognize a particular word or phrase among many, it needs to compare the pattern of the uttered word with all the stored patterns in the database.
Template matching functions well as long as the number of words to be recognized is limited, and the user speaks in a consistent manner and interjects pauses between words. Discrete, or isolated, speaker dependent recognition systems generally recognize up to 1,000 words and are useful in very specific busy-hands, busy-eyes environments, e.g. in luggage handling systems and quality control departments.
The functional limitations inherent to template matching systems make this method of speech recognition unacceptable for most mainstream applications.
Research and development have come a long way since template matching technology. Large vocabulary speech-recognition systems, those with 20,000 words or more, now use phoneme recognition. Phonemes are the smallest acoustical component of a language—there are about 80 phonemes that make up the English language. Note that phonemes are not similar to the letters in the alphabet—the letter 'A', for example, takes on different forms like in the words 'car' and 'make'. Phoneme recognition is the key to success in speech recognition technology, but it is not the whole story.
Speech recognition and natural language processing systems are fairly complex pieces of software. There are a variety of algorithms used in the implementation of such systems. Speech recognition works by disassembling sound into atomic units and then piecing back together. Natural language processing attempts to translate words into ideas by examining context, patterns, phrases, etc.
Speech recognition works by breaking down sounds the hardware "hears" into smaller, non-divisible, sounds called phonemes. Phonemes are distinct, atomic units of sound. For example, the word "those" is made up of three phonemes; the first is the "th" sound, the second the hard "o" sound, and the final phoneme the "s" sound. A series of phonemes make up syllables, syllables make up words, and words make up sentences, which in turn represent ideas and commands. Generally, phonemes can be thought of as the sound made by one or more letters in sequence with other letters. When the speech recognition software has broken sounds into phonemes and syllables, a "best guess" algorithm is used to map the phonemes and syllables to actual words.
Once the speech recognition software translates sound into words, natural language processing software takes over. Natural language processing software parses strings of words into logical units based on context, speech patterns, and more "Best Guess" algorithms. These logical units of speech are then parsed and analyzed, and finally translated into actual commands the computer can understand based on the same principles used to generate logical units.
Phoneme Inventory or Vocabulary
A binary tree that comprises of all the words that a system can recognize. The vocabulary is designed to contain the words of a specific user group, e.g. lawyers, radiologists.
Pronunciation Inventory, or Speaker Reference File
A database in which a specific speaker's phoneme pronunciations are stored.
It is a statistical and stochastic database that assists the speech recognition engine in determining the words to be recognized. The language model is designed in parallel with the vocabulary. It is based upon texts from a specific user group and contains information on word usage and sentence structure.
When recognizing speech, whether discrete or natural, the system goes through the following steps:
· The speech signal is digitized.
· The digitized signal goes through a Fourier transformation, which computes the energy levels in the signal of a 25-millisecond frame, with a 10-millisecond overlap. The result of this process is a series of vectors that constitute the information used for the recognition process. The vectors are compared with prototypes stored in the system's memory, i.e. the speaker reference file. Given that there are variations in the user's speech (pitch, speed, etc.), a process comprising Hidden Markov Modeling (HMM) is used to compensate for variations in the duration of the phonemes. Once a prototype has been successfully identified, the result is used to start the search for a word.
· The system is assisted by a language model when searching for a word. This model comprises probabilities of one word following another. The language model contains statistical and stochastic information derived from texts covering a specific application area, e.g. legal documents. The language model dramatically enhances the system's recognition performance because it reduces the perplexity of the language. Perplexity is expressed in a number and is an indication of how many different words are likely to follow a particular word. The word 'Dear' in a letter can be followed by 'Sir', 'Madam', etc. You will probably find a dozen possibilities for this combination, which is a low perplexity.
Without a language model, the perplexity of a system with a vocabulary of 35,000 words would be equal to the vocabulary size. The language model, however, greatly reduces the perplexity because it knows the likely word combinations, without preventing word combinations to be used that it has not seen before.
Although most commercially available speech recognition systems use similar processes, e.g. phoneme based, HMMs and statistical language models, there are distinct differences in speed and error rates. But more importantly, there will be differences in the way the user can enter speech. Some systems require the user to speak with short pauses between each word, a method called discrete speech recognition technology. In contrast, Philips natural speech recognition technology allows users to speak naturally, as if having a conversation. Systems incorporating natural speech recognition technology have the greatest potential to become accepted by users because they allow normal speech.
Voice command is the generic term for applications that allow you to enter or retrieve data, or have a computer or machine carry out tasks. This application area is divided into three categories:
These applications allow users to control a computer, machine or appliance by saying simple words or short phrases, like "start", "stop" and "dial". They are already in use in a wide variety of applications, such as telephone banking, quality control systems, and parcel handling systems.
These systems allow the user to say a complete sentence, out of which the computer will pick the words, which are important to carry out an action or provide the user with the appropriate information.
Philips has a train timetable information system that works on this basis. People use a telephone to call the system and say, for example, "I want to take the train from Faisalabad to Okara on Sunday morning". Depending on the amount of information required, the system can then ask for more input, e.g. "Do you wish to travel First Class or Economy Class?"
The system employs voice input and voice output, either by using pre-recorded phrases or by using speech synthesis.
The most advanced category of voice command applications is one that enables the computer to take whole sentences, interpret them, and perform the requested tasks. Voice command applications that allow users to employ natural speech have great potential. The ergonomics of the application determine user acceptance-applications that force the user to speak in an unnatural manner will be met with limited success.
The following picture shows a wavaform of a word ``She''.
Description:
This type of application allows the user to verbally control equipment in situations where manual control is inconvenient, dangerous, or impossible.
Engine:
Command applications can generally be quite successful with fairly limited vocabularies. In situations where a small group of users will always be using the equipment, the increased accuracy of speaker-dependant systems may be worth the training time.
Examples:
Voice-dialed cellular phones, Voice-programmed VCRs, Computer programs that allow voice macros.
Description:
Dictation applications type what the user says. Some restrict the user to a certain "structure," or form, and are generally used for filling out standardized reports.
Engine:
Dictation applications require very large vocabularies, on the order of tens of thousands of words. Because of this, completely speaker-dependent engines are unrealistic because of the prohibitive time spent training them. Therefore, most employ speaker-adaptive techniques. Most dictation applications require discrete input.
Examples:
Commercial software like DragonDictate, IBM VoiceType, and Kurzweil Voice. Specialized report software for medical or legal reports.
Description:
Almost a subset of dictation applications, data entry applications enable the user to input information without taking their attention away from what they are doing.
Engine:
The vocabulary size depends largely upon the situation. In some applications, just knowing the digits may be all that is needed. In others, a vocabulary of thousands might be required. Only a few people generally use data entry applications, so speaker-dependent engines are often used to increase accuracy.
Examples:
Any kind of hands, or eyes-busy information gathering
Description:
This use of speech recognition gives users voice control of a database or voice access to customer service, often by phone.
Engine:
Because these systems must use speaker-independent engines, they usually restrict themselves to small and easily recognized vocabularies to improve accuracy.
Examples:
Systems such as voice mail in which the user must pick among several options: call routing, banking by phone.
Speech synthesis can be defined as machine-to-person communication. It is the direct opposite of speech recognition, which is person-to-machine communication.
For text to be translated into speech there must be a way to break the text into recognizable sound units. These can be as large as sentences or words, though for an application with many utterances this can pose problems with recording and storage. Similar problems arise for syllables and half-syllables. It is possible to identify individual sound segments, which can be synthesized by the computer, making it unnecessary to store recorded voice.
The written representation of sounds is known as phonetic transcription and the simplest units are known as phonemes. There are approximately 40 phonemes in the English language. Each of these phonemes can be further transcribed into several allophones (a narrower transcription unit) depending on the sounds made on either side of the sound in question. However, a narrower makes the difference between intelligible speech and intelligent speech.
Speech is the natural method of communication between people, so there are impressive benefits to be gained if people could communicate with man made artifacts using the medium of speech. Speech synthesis is especially helpful for the disabled. For someone who is mute, telephone communication is possible by typing what they want to say into the computer and having the computer speak to the party on the other line. For someone who is blind they could have computer text read to them by the computer. The benefits for the disabled are priceless and limitless.
There are two fundamental aspects to speech synthesis. The first is concerned with determining the stored units (such as phonemes, word, sentences, etc.) and the various methods of producing realistic sounding (i.e., smooth) speech. The second is concerned with the hardware required, the synthesizer. The synthesizer accepts the parameters produced by the synthesis logic, converts these to analog signals and then produces the required sound.
The main objective of digital coding is to reduce the number of bits per second needed to store the speech signal and to reproduce an acceptable sound. However, there is an inverse correlation between the two. The fewer bits per second used the poorer the quality of the sound.