Authors A. V. Semenova, V. M. Kureichik
Month, Year 02, 2018 @en
Index UDC 004.041
DOI 10.23683/2311-3103-2018-2-163-173
Abstract Artificial intelligence is currently one of the promising areas of scientific and practical knowledge. In artificial intelligence, ontologies are used for the formal specification of knowledge. The article proposes an approach to automating the ontology replenishment process in text corpus related to the same domain. The key purpose of the paper is development of ensemble of classifiers for the task of domain ontology population. The main task of creating of the ensemble is to increase precision of the forecast of the aggregated classifier in comparison with the precision of individual baseline classifier. To achieve this goal, a new version of the ensemble of classifiers based on the method of support vector machine (SVM-classifier), neural network (LSTM-classifier) and methods of distributional semantics (Fasttext, word embedding) is developed. The principal difference of the ensemble from the known approaches is the method of representing the solution and the possibility of forming groups of classifiers. In the process of optimization, parameters are determined, both for individual classifiers and for the entire ensemble. The development of the ensemble of classifiers was performed in Matlab using the Text Analytics Toolbox. The ensemble of classifiers is built on a set of data for machine learning Reuters-21578 (news articles). To learn the models of distributional semantics, the GloVe vector collection for the English language trained in Wikipedia 2014 was selected. Comparative testing showed the advantages of using the proposed ensemble of classifiers when working with multidimensional data, characterized by a large number of features. The proposed ensemble of classifiers can be used to define the topic of a document, to extract terms from text documents and construct a thesaurus. Distinctive features of the developed ensemble of classifiers are: soft requirements to the initial data; automatic selection of the terms of the field of knowledge; the possibility of using an algorithm to construct ontologies of different areas of scientific knowledge without modifying it; high quality of data classification at an acceptable time.

Keywords Classification; ensemble; ontology; terms; knowledge base; domain; features; text corpus; neural network.
