High-Level Semantic Interpretation of the Russian Static Models Structure

NSU Vestnik. Series: Linguistics and Intercultural Communication Pub Date : 2023-05-30 DOI:10.25205/1818-7935-2023-21-1-67-82

O. Serikov, V. A. Geneeva, A. A. Aksenova, E. S. Klyshinskiy

{"title":"High-Level Semantic Interpretation of the Russian Static Models Structure","authors":"O. Serikov, V. A. Geneeva, A. A. Aksenova, E. S. Klyshinskiy","doi":"10.25205/1818-7935-2023-21-1-67-82","DOIUrl":null,"url":null,"abstract":"Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent years, an approach based on probing—analysis of the impact of small changes in the model on the result—has been actively developed. In this paper, we propose a new method for interpreting the arrangement of words in a vector space, applicable for the high-level interpretation of the entire space as a whole. The method provides for identifying the main directions which are selecting large groups of words (about a third of all the words in the model’s dictionary) and opposing them by some semantic features. The method allows us to build a shallow hierarchy of such features. We conducted our experiments on three models trained in different corpora: Russian National Corpus, Araneum Russicum and a collection of scientific articles from different subject domains. For our experiments, we used only nouns from the models’ dictionaries. The article considers an expert interpretation of such division up to the third level. The set of selected features and their hierarchy differ from model to model, but they have a lot in common. We have found that the identified semantic features depend on the texts comprising a corpus used for the model training, their subject domain, and style. The resulting division of words does not always correlate with the common sense used for ontology development. For example, one of the coinciding features is the abstract or material nature of the object. However, at the upper level of models, words are divided into everyday/special lexis, archaic lexis, proper names and common nouns. The article provides examples of words included in the derived groups.","PeriodicalId":434662,"journal":{"name":"NSU Vestnik. Series: Linguistics and Intercultural Communication","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NSU Vestnik. Series: Linguistics and Intercultural Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25205/1818-7935-2023-21-1-67-82","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent years, an approach based on probing—analysis of the impact of small changes in the model on the result—has been actively developed. In this paper, we propose a new method for interpreting the arrangement of words in a vector space, applicable for the high-level interpretation of the entire space as a whole. The method provides for identifying the main directions which are selecting large groups of words (about a third of all the words in the model’s dictionary) and opposing them by some semantic features. The method allows us to build a shallow hierarchy of such features. We conducted our experiments on three models trained in different corpora: Russian National Corpus, Araneum Russicum and a collection of scientific articles from different subject domains. For our experiments, we used only nouns from the models’ dictionaries. The article considers an expert interpretation of such division up to the third level. The set of selected features and their hierarchy differ from model to model, but they have a lot in common. We have found that the identified semantic features depend on the texts comprising a corpus used for the model training, their subject domain, and style. The resulting division of words does not always correlate with the common sense used for ontology development. For example, one of the coinciding features is the abstract or material nature of the object. However, at the upper level of models, words are divided into everyday/special lexis, archaic lexis, proper names and common nouns. The article provides examples of words included in the derived groups.

查看原文本刊更多论文

俄语静态模型结构的高级语义解释

自成立以来，Word2vec向量空间已成为科学和实践活动的通用工具。随着时间的推移，很明显缺乏解释向量空间中单词位置的新方法。现有的方法包括考虑类比或向量空间的聚类。近年来，一种基于探测的方法——分析模型的微小变化对结果的影响——得到了积极的发展。在本文中，我们提出了一种新的解释向量空间中单词排列的方法，适用于整个空间作为一个整体的高层次解释。该方法提供了识别主要方向，即选择大量单词(约占模型字典中所有单词的三分之一)并通过一些语义特征来反对它们。该方法允许我们建立这些特征的浅层次结构。我们在三个不同语料库上进行了实验:俄罗斯国家语料库、俄罗斯阿兰尼姆语料库和来自不同学科领域的科学文章集合。在我们的实验中，我们只使用模型字典中的名词。本文考虑了专家对这种划分的解释，直到第三个层次。所选的特征集及其层次结构因模型而异，但它们有很多共同之处。我们发现，识别的语义特征取决于包含用于模型训练的语料库的文本、它们的主题领域和风格。由此产生的词的划分并不总是与用于本体开发的常识相关联。例如，一个一致的特征是对象的抽象或物质性质。然而，在模型的上层，词被分为日常/特殊词汇、古词汇、专有名词和普通名词。本文提供了派生组中包含的单词的示例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

NSU Vestnik. Series: Linguistics and Intercultural Communication

自引率

0.00%

发文量