The simplest error correction interface is to force the user to respeak the whole utterance numerous times until the recognizer gets it correct. This interface may be easy to design and build, but it meets with very low user acceptance, due to the fact that greater user investment of time does not lead to a greater likelihood of the error being corrected.
Another interface design is to force the user to edit the recognized text with a keyboard or mouse-based editor. Though this method may guarantee correction of the errors, it requires the user to switch input modalities to accomplish a single task, and also eliminates many of the hands-free, eyes-free benefits of a speech interface.
A better recognition repair interface is one which allows the user to repair misrecognitions by voice only, in a way that is natural and effective in human to human communication. One of the most common human to human repair methods is to respeak a misrecognized portion of an utterance, speaking more clearly, and often hyper-articulating the word or word sequence which was misrecognized. If some word or words are particularly confusing, humans will often use spelling as a method of disambiguation.
In this paper, we describe methods used to implement a speech interface for repairing misrecognitions by simply respeaking or spelling a misrecognized section of an utterance. While much speech "repair" work has focused on repairs within a single spontaneous utterance [1], we are concerned with the repair of errorful recognizer hypotheses.
Three methods will be described here: One, the automatic subpiece location method; Two, the spoken hypothesis repair method; and Three, the spelling hypothesis repair method.
Figure 1 shows the standard repair paradigm used. The speaker first utters a primary utterance, which is recognized in the primary recognition. If an error occurs, the speaker respeaks or spells the erroneous subsection of the primary utterance. This secondary utterance (or repair utterance) is recognized in the secondary recognition, using a language model constructed separately for the specific repair situation. The results of both steps are then used to locate and/or repair the original error.
Given that the user will respeak some unknown subsection of the primary utterance, a language model is created which will allow all substrings of the first hypothesis of the primary recognition. The secondary utterance (a respeaking of a subpiece of the primary utterance) is then run through the recognizer, using this newly constructed language model. This will produce the secondary n-best list of possible choices for the respoken subpiece. Each hypothesis in the secondary n-best list (from best to worst) is evaluated to determine if it is a substring of the first hypothesis of the primary recognition. If it is a substring, the evaluation stops, and the endpoints of this substring are returned as the location of the respoken subpiece.
There is some possibility that no subpiece is found, since wordpair language models are not strong enough to guarantee this. A finite state grammar to constrain the search would be able to guarantee that only exact substrings are produced. In this set of experiments, wordpair models were found to be sufficient, always producing some subpiece within the first five secondary recognition hypotheses.
There is also the problem that there might be multiple, identical subpieces in the primary recognition first hypothesis. In this case, recognizing exactly what sequence of words was respoken is not enough to determine which of any identical sequences in the utterance was respoken. This problem would be most prevalent in commonly repetitive strings of numbers or letters. For the current experiments with the Resource Management task, the first matching subpiece (scanned in normal reading order) in the primary recognition hypothesis was used. Though other selection criteria could be used, this simple method was found to work well for this mostly non-repetitive RM task.
Preliminary testing of this method showed that it works poorly if the subpiece to be located is only one or two short words, as might be expected. Though a great disadvantage in specific situations, this drawback is not seen much in actual usage, since humans tend to respeak a few words around the error to make it easier for other humans to locate the exact position in the utterance where the misrecognition occurred.
The idea is relatively simple, but requires the assumption that the correct subpiece is in the n-best list (or word lattice) somewhere. A language model is created which restricts each hypothesis of the secondary recognition to be one of the alternative substrings in the primary n-best list at the same location as the highlighted errorful substring. The secondary utterance is then recognized using this new language model. In the simplest form, the top hypothesis from this secondary recognition can be used to replace the errorful subsection of the first hypothesis of the primary recognition.
In this experiment, the language model used was a simple bigram model (no unseen wordpair probability) based only on the counts found in the appropriate subpieces of the n-best list.
To find all the possible subpieces in the n-best list which were alternatives for the highlighted section of the best hypothesis, the start and end frames of the highlighted section were determined. In all other n-best hypotheses, the subpiece was chosen to include any words between or overlapping these start and end frames. Only unique substrings were used to determine the counts for the bigram language model. The original subpiece (known to contain at least one error) is also excluded from the language model data so that it cannot reoccur.
Merely replacing the errorful subsection with the top hypothesis from the secondary recognition means that all of the subpiece order information from the primary n-best list is unused. To make use of this information, we use a method which rescores and reorders the secondary recognition list by averaging the scores from the secondary recognition list with scores of identical subpieces in the primary recognition list.
Another method tried is to let the spelling recognizer do a free recognition (no language model), and then score each possible subpiece by its dtw distance from the recognized sequence. This gives a score for each subpiece, which allows the combination of scores from the spelling recognition and the primary n-best list to come up with the best replacement subpiece.
In experiment 1, the original ARPA speakers' utterances were used as the primary utterance, and, in those cases where recognition errors occurred, a separate speaker recorded both the respoken and spelled repair utterances.
In experiment 2, the same speaker spoke all 390 primary utterances as well as the respoken repair utterances for those primary utterances that were misrecognized.
For these experiments, the CSR recognizer was run in a sub-optimal mode, in order to generate more errorful tokens over our test database.
Table 2 shows the success rates for the various repair methods in both experiments. The column labeled "Highlight" reports the results when the errorful section was highlighted exactly by hand. The other column gives the results when the highlighting was with the auto-locate method described in section 2.1.
Table 3 shows the improvements in overall sentence accuracy when using the separate and combined repair mechanisms.
Our results indicate that repeating or spelling a misrecognized subsection of an utterance can be an effective way of repairing more than two thirds of recognition errors. These techniques alone do not guarantee the ability to correct a misrecognition every time, but when used as important components of a total speech interface, these and similar improvements should lead to greater user acceptance of speech interfaces in practical applications.
[2] O. Schmidbauer and J. Tebelskis. An LVQ based Reference Model for Speaker-Adaptive Speech Recognition. In Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 441-444.
[3] H. Hild. Speaker-Independent Connected Letter Recognition with a Multi-State Time Delay Neural Net work. In Proceedings of the 3rd European Conference on Speech, Communication and Technology (EUROSPEECH) 1993.