Article

Article title ENSEMBLE OF CLASSIFIERS FOR ONTOLOGY ENRICHMENT
Authors A. V. Semenova, V. M. Kureichik
Section SECTION III. MODELING AND ARTIFICIAL INTELLIGENCE
Month, Year 02, 2018 @en
Index UDC 004.041
DOI 10.23683/2311-3103-2018-2-163-173
Abstract Artificial intelligence is currently one of the promising areas of scientific and practical knowledge. In artificial intelligence, ontologies are used for the formal specification of knowledge. The article proposes an approach to automating the ontology replenishment process in text corpus related to the same domain. The key purpose of the paper is development of ensemble of classifiers for the task of domain ontology population. The main task of creating of the ensemble is to increase precision of the forecast of the aggregated classifier in comparison with the precision of individual baseline classifier. To achieve this goal, a new version of the ensemble of classifiers based on the method of support vector machine (SVM-classifier), neural network (LSTM-classifier) and methods of distributional semantics (Fasttext, word embedding) is developed. The principal difference of the ensemble from the known approaches is the method of representing the solution and the possibility of forming groups of classifiers. In the process of optimization, parameters are determined, both for individual classifiers and for the entire ensemble. The development of the ensemble of classifiers was performed in Matlab using the Text Analytics Toolbox. The ensemble of classifiers is built on a set of data for machine learning Reuters-21578 (news articles). To learn the models of distributional semantics, the GloVe vector collection for the English language trained in Wikipedia 2014 was selected. Comparative testing showed the advantages of using the proposed ensemble of classifiers when working with multidimensional data, characterized by a large number of features. The proposed ensemble of classifiers can be used to define the topic of a document, to extract terms from text documents and construct a thesaurus. Distinctive features of the developed ensemble of classifiers are: soft requirements to the initial data; automatic selection of the terms of the field of knowledge; the possibility of using an algorithm to construct ontologies of different areas of scientific knowledge without modifying it; high quality of data classification at an acceptable time.

Download PDF

Keywords Classification; ensemble; ontology; terms; knowledge base; domain; features; text corpus; neural network.
References 1. Nayhanova L.V. Tekhnologiya sozdaniya metodov avtomaticheskogo postroeniya ontologiy s primeneniem geneticheskogo i avtomatnogo programmirovaniya: monografiya [Technology the development of methods for the automatic construction of ontologies with the use of genetic and automata-based programming: monograph]. Ulan-Ude: Izd-vo BNTs SO RAN, 2008, 244 p.
2. Bubareva O.A. Matematicheskaya model protsessa integratsii informatsionnyikh sistem na osnove ontologiy [Mathematical model of the process of integration of information systems based on ontologies], Sovremennye problemy nauki i obrazovaniya [Modern problems of science and education], 2012, No. 2. Available at: www.science-education.ru/102-6030.
3. Semenova A.V., Kureichik V.M. Combined Method for Integration of Heterogeneous Ontology Models for Big Data Processing and Analysis, Proceedings of the 6th Computer Science On-line Conference 2017 (CSOC2017), Vol .1, pp. 302-311.
4. Parkhomenko P.A., Grigorev A.A, Astrakhantsev N.A. Obzor i eksperimental’noe sravnenie metodov klasterizatsii tekstov [Review and experimental comparison of text clustering methods], Trudyi ISP RAN [Proceedings of ISP RAS], 2017, Vol. 29, Issue. 2, pp. 161-200. DOI: 10.15514/ISPRAS-2017-29(2)-6.
5. Andrews Nicholas O, Fox Edward A. Recent developments in document clustering: Tech. Rep.: Technical report, Computer Science, Virginia Tech, 2007.
6. Aggarwal Charu C, Zhai Cheng Xiang. Mining text data. Springer Science & Business Media, 2012.
7. Whissell John S., Clarke Charles L.A. Improving document clustering using Okapi BM25 feature weighting, Information retrieval, 2011, Vol. 14, No. 5, pp. 513-523.
8. Huang Anna. Similarity measures for text document clustering, Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. 2008, pp. 49-56.
9. Sathiyakumari K., Manimekalai G., Preamsudha V. A survey on various approaches in document clustering, International Journal of Computer Technology and Applications, 2011, Vol. 2 (5), pp. 1534-1539.
10. Aggarwal Charu C, Zhai Cheng Xiang. Mining text data. Springer Science & Business Media, 2012.
11. Marchionini Gary. Exploratory search: from finding to understanding, Communications of the ACM, 2006, Vol. 49, No. 4, pp. 41-46.
12. Vyugin V.V. Matematicheskie osnovy mashinnogo obucheniya i prognozirovaniya [Mathematical foundations of machine learning and forecasting]. Moscow: MTsNMO, 2014, 304 p.
13. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, In Proceedings of Workshop at ICLR, 2013. Available at: https://arxiv.org/abs/1301.3781.
14. Pennington J., Socher R., Manning Ch. D. GloVe: Global Vectors for Word Representation. Available at: http://www.aclweb.org/anthology/D14-1162.
15. Bojanowski P., Grave E., Joulin A., Mikolov T. Enriching Word Vectors with Subword Information. Available at: https://arxiv.org/abs/1607.04606.
16. Choi F., Wiemer-Hasting P., Moore J. Latent semantic Analysis for Text Segmentation, Proceedings of NAACL'01, Pittsburgh, PA, 2001, pp. 109-117.
17. Gama J. Knowledge Discovery from Data Streams. Singapore, CRC Press Pubh, 2010. DOI: 10.1201/EBK1439826119.
18. Tomin N., Zhukov A., Sidorov D., Kurbatsky V., Panasetsky D., Spiryaev V. Random Forest Based Model for Preventing Large-Scale Emergencies in Power Systems, International Journal of Artificial Intelligence, 2015, Vol. 13, no. 1, pp. 221-228.
19. KiberLeninka. Available at: https://cyberleninka.ru/article/n/modifikatsiya-algoritma-sluchaynogo-lesa-dlya-klassifikatsii-nestatsionarnyh-potokovyh-dannyh.
20. Dyakonov V., Kruglov V. Matematicheskie pakety rasshireniya MATLAB. Spetsialnyy spravochnik [Mathematical expansion packs MATLAB. Special reference]. Saint Petersburg: Piter, 2001, 480 p.
21. Reuters-21578 Text Categorization Test Collection. Available at: http://www.daviddlewis.com/ resources/testcollections/reuters21578.
22. Ageev M., Kuralenok I., Nekrestyanov I. Ofitsialnyie metriki ROMIP 2006 [Official metrics of ROMIP 2006]. Available at: http://romip.ru/romip2006/appendix_a_metrics.pdf.

Comments are closed.