|Article title||METHOD FOR MEASURING THE SEMANTIC-SIMILARITY OF TEXTUAL DOCUMENTS|
|Authors||Bermudez Soto José Gregorio|
|Section||SECTION I. INFORMATION-MEASURING SYSTEM|
|Month, Year||03, 2017 @en|
|Abstract||The paper considers a method of comparing textual documents in the processing of natural language in Russian with the purpose of determining their semantic proximity; considered is the subtask of measuring the semantic similarity according to the criteria of correctness and depth. On the basis of the conducted review of existing approaches of texts comparison, a method is proposed for determining the semantic similarity between two texts on the basis of textual passages, which makes it possible to determine not only the semantic proximity of documents presented in natural language, but also quantify the similarity of these documents. This study is framed in the field of automatic text processing (ATP) and the formalization of natural languages, gradually shifting from the simplest methods of analysis to the more complex, gradually reaching a level of processing that can already see the text not just as a sequence of words, but as a single whole, which has some meaning, as it corresponds to human perception. In accordance with the general scheme of automatic text processing, this study is focused on the semantic level and is a detailed description of the final stage about comparing the closeness of the general scheme. The method is based on determining the degree of similarity between the passages. Under the passage we mean a separate place in the text, which has some kind of integrity. This work uses segmentation of texts as a basis for text comparison in the natural language processing in Russian; it will be considered subtask of extracting parts of text with a special meaning, which are called "passage". Also the comparison of texts in Russian is used, in the subtask of determination of semantic proximity. A review of existing methods of comparison is given. The determination method of degree of simi-larity between textual passages within a semantic class is proposed. Existing methods are compared with the proposed method and a comparison made by people in an experiment, which shows the suitability of the proposed method.|
|Keywords||Measurement of textual proximity; definition of similarity; comparison of texts; presentation of semantic schemes; passages.|
|References||1. Yazykoznanie. Bol. entsikl. slovar' [Linguistics. Big encyclopaedic dictionary], chief ed.
V.N. Yartseva. 2nd ed. Moscow: Bol. ros. entsikl., 1998, 685 p.
2. Marchuk Yu.N. Komp'yuternaya lingvistika [Computational linguistics]. Moscow: AST; Vostok-Zapad, 2007, 317 p.
3. Gaydamakin N.A. Avtomatizirovannye informatsionnye sistemy, bazy i banki dannykh. Vvod-nyy kurs: ucheb. posobie [Automated information systems, databases and data. Introductory course: textbook]. Moscow: Gelios ARV, 2002, 368 p.
4. Baranov A.N. Vvedenie v prikladnuyu lingvistiku: ucheb. posobie [Introduction to applied linguistics: textbook]. Moscow: Editorial URSS, 2001, 360 p.
5. Iskusstvennyy intellect [Artificial intelligence]. In 3 book. Book 1. Sistemy obshcheniya i ek-spertnye sistemy: spravochnik [Communication and expert systems: a Handbook], ed. byE.V. Popova. Moscow: Radio i svyaz', 1990, 464 p.
6. Potapova R.K. Rech': kommunikatsiya, informatsiya, kibernetika: ucheb. posobie [Speech: communication, information, cybernetics: textbook]. Moscow: Editorial URSS, 2003, 568 p.
7. Muñoz T.R. Representación del conocimiento textual mediante técnicas lógico-conceptuales en aplicaciones de tecnologías del lenguaje humano, Tesis doctoral. Universidad de Alicante. Es-paña, 2009, 128 p.
8. Maurer H., Kappe F. y Zaka B. Plagiarism – A Survey, Journal of Universal Computer Sci-ence, 2006, No. 12, pp. 1050-1084.
9. Bao J-P., Shen J-Y., Liu X-D., Liu H-Y. y Zhang X-D. Semantic Sequence Kin: A Method of Document Copy Detection, Advances In Knowledge Discovery and Data Mining. Lecture Notes in Artificial Intelligence (LNAI) – Sydney, Australia, 2004, Vol. 3056, pp. 529-538.
10. Bao J-P., Shen J-Y., Liu X-D., Liu H-Y. y Zhang X-D. Finding Plagiarism Based on Common Semantic Sequence Model, The 5th International Conference on Advances in Web-Age Infor-mation Management (WAIM). Lecture Notes in Computer Science – China: Dalian, 2004,
Vol. 3129, pp. 640-645.
11. Chi-Hong L. y Yuen-Yan C. A Natural Language Processing Approach to Automatic Plagiarism Detection, The 8th ACM Conference on Information Technology Education (SIGITE’07) – Florida, USA, 2007, pp. 213-218.
12. Vishnyakov R.Yu. Razrabotka i issledovanie formalizovannykh predstavleniy i semanti-cheskikh skhem predlozheniy tekstov nauchno-tekhnicheskogo stilya dlya povysheniya effektivnosti informatsionnogo poiska: diss. … kand. tekhn. nauk [The development and study of formal representations and semantic diagrams of the sentences of the texts of scientific-technical style to improve the efficiency of information retrieval. Cand. of eng. sc. diss.]. Taganrog, 2012.
13. Bermudes S.Kh.G. O metode izvlecheniya znachimykh tekstovykh passazhey kak bazy dlya tekstovogo sravneniya [On the method of extraction of important text passages as a basis for tech-stowage comparison], Informatizatsiya i i svyaz' [Informatization and communication], 2016, No. 3, pp. 231-219.
14. Salguero L.F. Resolución abductiva de anáforas pronominales. Available at: http://www.http://personal.us.es/fsoler/papers/ivjornadas.pdf. (accessed 29 January 2016).
15. Agirre E., Cer D., Diab M., Gonzalez-Agirre A. and Weiwei Guo. A pilot on semantic textual similarity, The 6th International Workshop on Semantic Evaluation (SemEval-2012 task 6) – Atlanta, USA, 2012, pp. 385-393.
16. Agirre E., Cer D., Diab M., Gonzalez-Agirre A. and Weiwei Guo. Semantic textual similarity, 2nd Joint Conference on Lexical and Computational Semantics (*SEM-2013) – Georgia, USA, 2013, pp. 32-43.
17. Michael R. and Anette F. Automatically identifying implicit arguments to improve argument linking and coherence modeling, 2nd Joint Conference on Lexical and Computational Semantics (*SEM-2013) – Georgia, USA, 2013, pp. 321-333.
18. Salehi B. and Cook P. Predicting the compositionality of multiword expressions using transla-tions in multiple languages ,Second Joint Conference on Lexical and Computational Semantics (*Sem-2013), Atlanta, Georgia, USA. 2013, pp. 134-142.
19. Palmer A., lexis Horbach A. and Pinkal M. Using the text to evaluate short answers for reading comprehension exercises, Second Joint Conference on Lexical and Computational Semantics (*SEM). – Vol. 2 (SemEval 2013) – Atlanta, Georgia, USA, 2013, pp. 520-524.
20. Leacock C. and Chodorow M. Combining local context and wordnet similarity for word sense identification, Christiane Fellbaum, editor, MIT Press, 1998, pp. 265-283.
21. Lesk M. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone, 5th Annual International Conference on Systems Documentation, 1986, pp. 24-26. ACM.
22. Wu Zhibiao and Stone Palmer M. Verb semantics and lexical selection, James Pustejovsky, editor, ACL, 1994, pp. 133-138. Morgan Kaufmann Publishers / ACL.
23. Resnik P. Using information content to evaluate semantic similarity in a taxonomy, 14th In-ternational Joint Conference on Artificial Intelligence, IJCAI’95. San Francisco, CA, USA, 1995, pp. 448-453.
24. Lin Dekang. An information-theoretic definition of similarity, Fifteenth International Confer-ence on Machine Learning, ICML ’98. – San Francisco, CA, USA. Morgan Kaufmann Pub-lishers Inc., 1998, pp. 296-304.
25. Jiang Jay J. and Conrath D.W. Semantic similarity based on corpus statistics and lexical taxonomy, 10th International Conference on Research in Computational Linguistics, ROCLING’97, 1997, pp. 19-33.
26. Mihalcea R., Corley C. and Strapparava C. Corpus-based and knowledge-based measures of text semantic similarity, 21st National Conference on Artificial Intelligence, 2006, pp. 775-780.
27. Turney Peter D. Mining the web for synonyms: Pmi-ir versus lsa on toefl, 12th European Conference on Machine Learning, 2001, pp. 491-502.
28. . Landauer Thomas K., Foltz Peter W. and Laham Darrell. An Introduction to Latent Semantic Analysis. Discourse Processes. Springer-Verlag, 1998, pp. 259-284.
29. Francesc Ll. C. Algoritmos de similitud entre cadenas de texto (php). 2015. Available at: fran-cescllorens.eu/00tokenizer/dst.php.