{"title":"Attributive Collocations in the Gold Standard of Russian Collocability and Their Representation in Dictionaries and Corpora","authors":"M. Khokhlova","doi":"10.17223/22274200/21/2","DOIUrl":null,"url":null,"abstract":"The article discusses how collocations are represented in Russian dictionaries and how information about them can be covered in a collocation database that is being developed. Such a resource (gold standard) can be in demand when developing applications for teaching or learning Russian as a foreign language and solving other theoretical and applied issues. The aim of the study was twofold: firstly, to analyze how explanatory and specialized dictionaries of the Russian language represent collocations and hence to what extent their data coincide with each other, and, secondly, to investigate how these dictionary collocations are reflected in text corpora. This allows tracing the relation between manually collected data and modern corpora. For the study, the author used the disambiguated subcorpus and the main corpus of the Russian National Corpus (RNC) with a volume of 6 million and 321 million words, respectively, as well as the large Internet corpus ruTenTen with a volume of more than 14.5 billion words. The author considered attributive phrases built according to the “adjective/participle + noun” model. She analyzed 120 collocations with different dictionary index, i.e. the number of dictionaries in which this phrase is given. The following hypothesis was tested: high collocation frequencies correspond to the fact that the item is recorded in several dictionaries. In the analysis, nonparametric analogues of analysis of variance (Friedman and Kruskal-Wallis tests) were used to assess the statistical significance of differences in quantitative data. The frequencies of collocations in corpora of different volume and in different dictionaries were compared. In total, more than 15 thousand examples were processed, less than 0.5% of them were presented in four of the six reviewed dictionaries (five printed and one electronic). The results show data heterogeneity, items selected for a dictionary do not coincide with their frequency characteristics and thus word combinations turn out to be low-frequency. About 34% of the examples are absent in the RNC corpus with removed ambiguity, and about 12% of analyzed collocations are rare (less than 0.01 ipm) even in the ruTenTen corpus. The presence of collocations in several dictionaries indicates their higher frequencies and hence reproducibility in speech. Explanatory dictionaries and collocation dictionaries show the smallest intersection of data. The results show that the amount of data is a crucial issue, and the very phenomenon of collocability should be studied on large corpora.","PeriodicalId":41132,"journal":{"name":"Voprosy Leksikografii-Russian Journal of Lexicography","volume":"1 1","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Voprosy Leksikografii-Russian Journal of Lexicography","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17223/22274200/21/2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 2
Abstract
The article discusses how collocations are represented in Russian dictionaries and how information about them can be covered in a collocation database that is being developed. Such a resource (gold standard) can be in demand when developing applications for teaching or learning Russian as a foreign language and solving other theoretical and applied issues. The aim of the study was twofold: firstly, to analyze how explanatory and specialized dictionaries of the Russian language represent collocations and hence to what extent their data coincide with each other, and, secondly, to investigate how these dictionary collocations are reflected in text corpora. This allows tracing the relation between manually collected data and modern corpora. For the study, the author used the disambiguated subcorpus and the main corpus of the Russian National Corpus (RNC) with a volume of 6 million and 321 million words, respectively, as well as the large Internet corpus ruTenTen with a volume of more than 14.5 billion words. The author considered attributive phrases built according to the “adjective/participle + noun” model. She analyzed 120 collocations with different dictionary index, i.e. the number of dictionaries in which this phrase is given. The following hypothesis was tested: high collocation frequencies correspond to the fact that the item is recorded in several dictionaries. In the analysis, nonparametric analogues of analysis of variance (Friedman and Kruskal-Wallis tests) were used to assess the statistical significance of differences in quantitative data. The frequencies of collocations in corpora of different volume and in different dictionaries were compared. In total, more than 15 thousand examples were processed, less than 0.5% of them were presented in four of the six reviewed dictionaries (five printed and one electronic). The results show data heterogeneity, items selected for a dictionary do not coincide with their frequency characteristics and thus word combinations turn out to be low-frequency. About 34% of the examples are absent in the RNC corpus with removed ambiguity, and about 12% of analyzed collocations are rare (less than 0.01 ipm) even in the ruTenTen corpus. The presence of collocations in several dictionaries indicates their higher frequencies and hence reproducibility in speech. Explanatory dictionaries and collocation dictionaries show the smallest intersection of data. The results show that the amount of data is a crucial issue, and the very phenomenon of collocability should be studied on large corpora.
本文将讨论如何在俄语词典中表示搭配,以及如何在正在开发的搭配数据库中包含有关搭配的信息。这种资源(黄金标准)在开发教学或学习俄语作为外语的应用程序以及解决其他理论和应用问题时是需要的。本研究的目的有两个:首先,分析俄语解释性词典和专业词典如何表示搭配,从而在多大程度上相互吻合;其次,研究这些词典搭配如何反映在文本语料库中。这允许跟踪手动收集的数据和现代语料库之间的关系。在研究中,作者使用了俄罗斯国家语料库(Russian National corpus, RNC)的消歧义子语料库和主语料库,分别为600万和3.21亿字,以及互联网大型语料库rutenen,其容量超过145亿字。作者考虑了按照“形容词/分词+名词”模式构建的定语短语。她分析了120种不同词典索引的搭配,即这个短语在多少本词典中被给出。测试了以下假设:高搭配频率对应于这个词被记录在几本词典中的事实。在分析中,使用方差分析的非参数类比(Friedman和Kruskal-Wallis检验)来评估定量数据差异的统计显著性。比较了不同语料库和不同词典中的搭配频率。总共处理了超过1.5万个例子,其中不到0.5%的例子出现在6本被审查的词典(5本印刷词典和1本电子词典)中的4本中。结果显示了数据的异质性,为词典选择的项目不符合它们的频率特征,因此单词组合变成了低频。在去除歧义的RNC语料库中,约有34%的例子是不存在的,即使在rutenen语料库中,也有约12%的分析搭配是罕见的(小于0.01 ipm)。几个词典中出现的搭配表明它们的频率较高,因此在讲话中具有可重复性。解释字典和搭配字典显示了最小的数据交集。结果表明,数据的数量是一个关键问题,并且应该在大型语料库上研究可搭配性现象。
期刊介绍:
The mission of the Russian Journal of Lexicography is to accumulate the intellectual potential of scholars and practitioners for the purpose of discussing and solving the topical issues of theoretical and applied lexicography, and new concepts of dictionary compilation.