Authors D. A. Birin, S. Yu. Melnikov, V. A. Peresypkin, I. A. Pisarev, N. N. Copkalo
Month, Year 08, 2018 @en
Index UDC 004.931
DOI 10.23683/2311-3103-2018-8-104-114
Abstract The capabilities of four automatic text correction software (Yandex.Speller, Afterscan, Bing Spell Check, Texterra) for noisy texts correction are analyzed. The distortions of texts that occur while typing text on the keyboard and recognition systems working are described. Experimental data on the accuracy of the correction of distorted texts obtained both by typing and as the output of real OCR systems processing low-quality images and ASR systems in a noisy environment are presented. To simulate the distortions caused by the recognition systems, a two-stage model of random text distortions is proposed. At the first stage (word distortions with a given probability) the distorted word in the text is replaced with a random dictionary word with Levenshtein distance 1 or 2. The replacement word is chosen according to the uniform distribution. At the second stage (character distortions with a given probability) the distorted character is removed with a probability of 1/3, or a random character is inserted before it with a probability of 1/3, or it is replaced with a random alphabet character with a probability of 1/3. The replacement character is chosen according to the uniform distribution. The distorted texts obtained in this way are corrected using the Yandex.Speller and Bing Spell Check software and the percentage of true words in the corrected text is calculated. The data are averaged over a set of texts. The results of experiments with an estimation of the correction accuracy in the following parameter range are given: the probabilities of word distortion vary from 0 to 0.9 and the probabilities of symbol distortion vary from 0 to 0.5. The results show that Yandex.Speller, Bing Spell Check and Texterra provide good quality of the correction of distortions that occur while typing. This software are ineffective for correcting distortions caused by the recognition systems.

Keywords Noisy texts; random distortions; automatic correction; post-processing.
