Authors A.A. Belozerov, S.Yu. Melnikov, V.A. Peresypkin, E.S. Sidorov, D.V. Vakhlakov
Month, Year 12, 2016 @en
Index UDC 004.931
DOI 10.18522/2311-3103-2016-12-2942
Abstract The paper describes an approach to building a software system for collecting and prepro-cessing text corpuses for natural language modelling. The system assumes that the list of sources is prepared by language experts that allows to increase collection speed and raise quality of the resulting corpus. The corpuses are collected from different Internet sources (mainly news web portals) by parsing and crawling RSS feeds, sitemap files and data from social networks. An ex-ample of how to collect such sources for Arabian language is given in the paper. The software system consists of several logical modules: links collection module, crawler, HTML parsing and text extraction module and web interface for two types of users - language expert and administrator. The original text extraction approaches based on "text quantity metric" as well as additional preprocessing step are also discussed. The preprocessing step applies fuzzy duplicates search algorithms and a filtering algorithm to remove repeated pieces of text and filter out articles that do not belong to the target language. The software system is implemented in Python with the use of several open source frameworks. The system works under Ubuntu OS on two dedicated servers of 16 CPU cores in total. In August 2016 the system was processing more than 20000 news sources on 14 languages from 70 countries. The whole list of sources is crawled during two hours. Text corpuses with sizes ranging from 500Mb till 20Gb were collected for all these languages. The described technology allows collecting text corpuses classified by country of origin, writing date, topics, type of source and also enriching the existing corpuses to build more precise natural lan-guage models. As an experiment, the collected data was used to build three-gram models for Eng-lish language (political topic) and compared in terms of perplexity to the similar ones built using well-known OANC and Europarl_v7 corpuses.

Download PDF

Keywords Text corpora; Parsing; Corpus Quality; Perplexity; Language model.
References 1. Kipyatkova I.S., Karpov A.A. Avtomaticheskaya obrabotka i statisticheskiy analiz novostnogo tekstovogo korpusa dlya modeli yazyka sistemy raspoznavaniya russkoy rechi [Automatic pro-cessing and statistical analysis of news text corpus for language model recognition systems for Russian speech], Informatsionno-upravlyayushchie sistemy [Information control systems], 2010, No. 4 (47), pp. 2-8.
2. Meshcheryakov R.V. Struktura sistem sinteza i raspoznavaniya rechi [The structure of the sys-tems of synthesis and speech recognition], Izvestiya Tomskogo politekhnicheskogo universiteta [Bulletin of the Tomsk Polytechnic University], 2009, Vol. 315, No. 5, pp. 127-132.
3. Mel'nikov S.Yu., Peresypkin V.A. O primenenii veroyatnostnykh modeley yazyka dlya obnaruzheniya oshibok v iskazhennykh tekstakh [On the application of probabilistic language models to detect errors in distorted texts], Vestnik komp'yuternykh i informatsionnykh tekhnologiy [Herald of computer and information technologies], 2016, No. 5, pp. 29-33.
4. Rosenfeld R. Two decades of statistic language modeling: where do we go from here?, in Pro-ceedings of the IEEE, 2000, Vol. 88, Issue 8, pp. 1270-1278.
5. Mel'nikov S.Yu., Peresypkin V.A. Tendentsii razvitiya yazykovykh modeley v zadachakh raspoznavaniya, aspekty tochnosti i vychislitel'noy trudoemkosti [Trends in the development of language models in pattern recognition, aspects of precision and computational complexity], Materialy 8-y Vserossiyskoy mul'tikonferentsii po problemam upravleniya MKPU-2015 [Pro-ceedings of 8-th all-Russian multiconference on problems of control, mcpo-2015].
s. Divnomorskoe, Vol. 1, pp. 85-87.
6. Proceedings of the 4th International Workshop on Spoken Language Technologies for Under-resourced Languages SLTU-2014. St. Petersburg, Russia, 2014, 268 p.
7. Vu N.T., Schlippe T., Kraus F., Schultz T. Rapid Bootstrapping of five Eastern European Lan-guages using the Rapid Language Adaptation Toolkit, In: Proc. of Interspeech 2010, Japan, Makuhari, pp. 865-868.
8. Biemann C., Bildhauer F., Evert S., Goldhahn D., Quasthoff U., Schäfer R., Simon J., Swiezinski L., Zesch T. Scalable construction of high-quality web corpora, Journal for Lan-guage Technology and Computational Linguistics, 2013, No. 28 (2), pp. 23-60.
9. Schlippe T., Gren L., Vu N.T., Schultz T. Unsupervised Language Model Adaptation for Auto-matic Speech Recognition of Broadcast News Using Web 2.0, Interspeech 2013, 25-29 August 2013, Lyon, France, pp. 2698-2702.
10. Kim C., Shim K. TEXT: Template Extraction from Heterogeneous Web Pages, IEEE Transac-tions on Knowledge and Data Engineering, 2011, Vol. 23, Issue 4, pp. 612-626.
11. Sivakumar P. Effectual Web Content Mining using Noise Removal from Web Pages, Wireless Personal Communications, 2015, Vol. 84 (1), pp. 99-121.
12. Eckart T., Quasthoff U., Goldhahn D. The Influence of Corpus Quality on Statistical Meas-urements on Language Resources, in: Proc. of the 8 Int. Conf. on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, 2012, pp. 2318-2321.
13. Sarkar A., De Roeck A., Garthwaite P. Easy measures for evaluating non-English corpora for language engineering. Some lessons from Arabic and Bengali, Dep. of Comp., Faculty of Math. and Comp., The Open University, Walton Hall, UK. Tech. Rep. №2004/05, pp. 1-5.
14. Spoustova J., Spousta M. A high-quality web corpus of Czech, in: Proc. of the 8 Int. Conf. on Language Resources and Evaluation (LREC'12), Istanbul, Turkey, 2012, pp. 311-315.
15. Zelenkov Yu.G., Segalovich I.V. Sravnitel'nyy analiz metodov opredeleniya nechetkikh dublikatov dlya Web-dokumentov [Comparative analysis of methods for determining near-duplicate Web documents], Trudy 9-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» – RCDL’2007, Pereslavl'-Zalesskiy, Rossiya, 2007 g. [Proceedings of 9-th all-Russian scien-tific conference "digital libraries: advanced methods and technologies, digital collec-tions" – RCDL'2007, Pereslavl, Russia, 2007], pp. 166-174.
16. Xiao C., Wang W., Lin X., Xu Y. J., Wang G. Efficient similarity joins for near-duplicate detec-tion, ACM Transactions on Database Systems (TODS), August 2011, Vol. 36, No. 3, pp. 1-41.
17. Available at:
18. Available at:
19. Kneser R., Ney H. Improved backing-off for m-gram language modeling, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. I, Detroit, Michigan: 1995, pp. 181-184.
20. Chen S.F., Goodman J. An empirical study of smoothing techniques for language modeling, Computer Science Group, Harvard University, Cambridge, Massachusetts, TR-8-98, August, 1998.

Comments are closed.