Article

Article title ON THE EFFICIENCY OF THE NOISY TEXT CORRECTION SOFTWARE DEPENDING ON THE DISTORTION TYPE
Authors D. A. Birin, S. Yu. Melnikov, V. A. Peresypkin, I. A. Pisarev, N. N. Copkalo
Section SECTION III. MATHEMATICAL AND SOFTWARE
Month, Year 08, 2018 @en
Index UDC 004.931
DOI 10.23683/2311-3103-2018-8-104-114
Abstract The capabilities of four automatic text correction software (Yandex.Speller, Afterscan, Bing Spell Check, Texterra) for noisy texts correction are analyzed. The distortions of texts that occur while typing text on the keyboard and recognition systems working are described. Experimental data on the accuracy of the correction of distorted texts obtained both by typing and as the output of real OCR systems processing low-quality images and ASR systems in a noisy environment are presented. To simulate the distortions caused by the recognition systems, a two-stage model of random text distortions is proposed. At the first stage (word distortions with a given probability) the distorted word in the text is replaced with a random dictionary word with Levenshtein distance 1 or 2. The replacement word is chosen according to the uniform distribution. At the second stage (character distortions with a given probability) the distorted character is removed with a probability of 1/3, or a random character is inserted before it with a probability of 1/3, or it is replaced with a random alphabet character with a probability of 1/3. The replacement character is chosen according to the uniform distribution. The distorted texts obtained in this way are corrected using the Yandex.Speller and Bing Spell Check software and the percentage of true words in the corrected text is calculated. The data are averaged over a set of texts. The results of experiments with an estimation of the correction accuracy in the following parameter range are given: the probabilities of word distortion vary from 0 to 0.9 and the probabilities of symbol distortion vary from 0 to 0.5. The results show that Yandex.Speller, Bing Spell Check and Texterra provide good quality of the correction of distortions that occur while typing. This software are ineffective for correcting distortions caused by the recognition systems.

Download PDF

Keywords Noisy texts; random distortions; automatic correction; post-processing.
References 1. Birin, D.A., Mel'nikov S.YU., Peresypkin V.A. Ob effektivnosti sredstv korrektsii iskazhennykh tekstov dlya rezul'tatov raboty sistem raspoznavaniya [About efficiency of means of correction of the distorted texts for results of work of systems of recognition], Superkomp'yuternye tekhnologii (SKT-2018): Materialy 5-y Vserossiyskoy nauchno-tekhnicheskoy konferentsii [Supercomputer technologies (SKT-2018): Materials of the 5th all-Russian scientific and technical conference]: in 2 vol. Vol. 1. Rostov-on-Don; Taganrog: Izd-vo YuFU, 2018, pp. 71-75.
2. Subramaniam L.V. et al. A survey of types of text noise and techniques to handle noisy text, Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, July 23-24, 2009, Barcelona, Spain. DOI: 10.1145/1568296.1568315.
3. Bassil Y., Alwani M. Post Editing Error Correction Algorithm for Speech Recognition using Bing Spelling Suggestion, International Journal of Advanced Computer Science and Applications, 2012, Vol. 3, No. 2, pp. 95-101.
4. Feld M., Momtazi S., Freigang F., Klakow D., Müller C. Mobile texting: can post-ASR correction solve the issues? An experimental study on gain vs. costs, Proceedings of the 2012 ACM international conference on Intelligent User Interfaces, February 14-17, 2012, pp. 37-40. Lisbon, Portugal. DOI: 10.1145/2166966.2166974.
5. Evershed J., Fitch K. Correcting Noisy OCR: Context beats Confusion DATeCH 2014, May 19–20, 2014, Madrid, Spain DOI:10.1145/2595188.2595200.
6. Lopresti D.P. Optical character recognition errors and their effects on natural language processing, International Journal on Document Analysis and Recognition (IJDAR), September 2009, Vol. 12, Issue 3, pp. 141–151. DOI: 10.1007/s10032-009-0094-8.
7. Packer T.L., Lutes J.F., Stewart A.P., Embley D.W., Ringger E.K., Seppi K.D., et al. Extracting person names from diverse and noisy OCR text, Proceedings of the fourth workshop on Analytics for noisy unstructured text data AND '10, 2010, pp. 19-26. DOI 10.1145/1871840.1871845
8. Kumar A., Lehal G.S. Automatic Text Correction for Devanagari OCR, Indian Journal of Science and Technology, December 2016, Vol. 9 (45). DOI: 10.17485/ijst/2016/v9i45/106372.
9. Gadde P., Goutam R., Shah R., Bayyarapu H.S., Subramaniam L.V. Experiments with artificially generated noise for cleansing noisy text, Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, MOCR AND ’11, pp. 4:1-4:8. ACM, 2011.
10. Dey L., Haque S.K.M. Studying the effects of noisy text on text mining applications, Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data AND’09. Barcelona, Spain, 2009, pp. 107-114.
11. Clark E., Araki K. Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English, Procedia - Social and Behavioral Sciences 27, December 2011, pp. 2-11.  DOI: 10.1016/j.sbspro.2011.10.577.
12. Saloot M.A., Idris N., Mahmud R. An architecture for Malay Tweet normalization, Inf. Process. Manag., 2014, Vol. 50, No. 5, pp. 621-633, DOI: 10.1016/j.ipm.2014.04.009.
13. Wang A., Kan M.-Y., Andrade D., Onishi T., Ishikawa K. Chinese Informal Word Normalization: an Experimental Study, International Joint Conference on Natural Language Processing, 2013, pp. 127-135. DOI: 10.1007/978-3-319-68612-7_25.
14. Tursun O., Cakici R. Noisy Uyghur Text Normalization, Proceedings of the 3rd Workshop on Noisy User-generated Text, Copenhagen, Denmark, September 7, 2017. – P. 85–93. DOI: 10.18653/v1/w17-4412.
15. Ikeda T., Shindo H., Matsumoto Y. Japanese Text Normalization with Encoder-Decoder Model, Proceedings of the 2nd Workshop on Noisy User-generated Text. – Osaka, Japan, December 11, 2016, pp. 118-126.
16. Bassil, Y., Alwani, M. OCR post-processing error correction algorithm using Google’s online spelling suggestion, Journal of Emerging Trends in Computing and Information Sciences, January 2012, Vol. 3, No. 1.
17. Спеллер – Технологии Яндекса. Available at: https://tech.yandex.ru/speller/ (accessed 08 November 2018).
18. AfterScan – post-OCR text proofing, advanced spell-checking, automatic correction. Available at: http://www.afterscan.com/ru/ (accessed 08 November 2018).
19. Turdakov D. i dr. Texterra: infrastruktura dlya analiza tekstov [Texterra: Infrastructure for text analysis], Trudy Instituta sistemnogo programmirovaniya RAN [Proceedings of Institute for system programming of Russian Academy of Sciences], 2014, Vol. 26, Issue 1, pp. 421-438. DOI: 10.15514/ISPRAS-2014-26(1)-18.
20. Microsoft Cognitive Services – API Bing проверки орфографии. Available at: https://azure.microsoft.com/ru-ru/services/cognitive-services/spell-check/ (accessed 08 November 2018).
21. Meshcheryakov R.V. Struktura sistem sinteza i raspoznavaniya rechi [Structure of speech synthesis and recognition systems], Izvestiya Tomskogo politekhn. un-ta [News of Tomsk Polytechnic University], 2009, Vol. 315, No. 5, pp. 127-132.
22. Smirnov S.V. Korrektirovka oshibok opticheskogo raspoznavaniya na osnove reytingo-rangovoy modeli teksta [Correction of optical recognition errors based on the rating-rank model of the text], Trudy SPIIRAN [SPIIRAS Proceedings], 2014, Issue 4, No. 35, pp. 64-82. DOI: 10.15622/sp.35.5.
23. Rudakov I.V., Romanov A.S. Raspoznavanie tekstovogo izobrazheniya s uchetom morfologii slova [Recognition of a text image taking into account the morphology of the word], Nauka i obrazovanie: nauchnoe izdanie MGTU im. N.E. Baumana [Science and education: scientific publication of MSTU. N.E. Bauman], 2012, Issue 4, pp. 1-6.
24. Farra N., Tomeh N., Rozovskaya A., Habash N. Generalized Character-Level Spelling Error Correction, ACL (2), 2014, pp. 161-167.
25. Belozerov A.A., Vakhlakov D.V., Mel'nikov S.YU., Peresypkin V.A., Sidorov E.S. Tekhnologicheskie aspekty postroeniya sistemy sbora i predobrabotki korpusov novostnykh tekstov dlya sozdaniya modeley yazyka [Technological aspects of creation of system of gathering and preprocessing of the corpora of news texts to create language models], Izvestiya YuFU. Tekhnicheskie nauki [Izvestiya SFedU. Engineering Sciences], 2016, No. 12 (185), pp. 29-42. DOI: 10.18522/2311-3103-2016-12-2942.

Comments are closed.