搭配排序:频率vs语义

Nikola Ljubesic, N. Logar, Iztok Kosem
{"title":"搭配排序:频率vs语义","authors":"Nikola Ljubesic, N. Logar, Iztok Kosem","doi":"10.4312/slo2.0.2021.2.41-70","DOIUrl":null,"url":null,"abstract":"Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.","PeriodicalId":371035,"journal":{"name":"Slovenščina 2.0: empirical, applied and interdisciplinary research","volume":"70 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Collocation ranking: frequency vs semantics\",\"authors\":\"Nikola Ljubesic, N. Logar, Iztok Kosem\",\"doi\":\"10.4312/slo2.0.2021.2.41-70\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.\",\"PeriodicalId\":371035,\"journal\":{\"name\":\"Slovenščina 2.0: empirical, applied and interdisciplinary research\",\"volume\":\"70 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Slovenščina 2.0: empirical, applied and interdisciplinary research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4312/slo2.0.2021.2.41-70\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Slovenščina 2.0: empirical, applied and interdisciplinary research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4312/slo2.0.2021.2.41-70","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

搭配在语言描述中起着非常重要的作用,尤其是对词义的识别。现代词典编纂中不可避免的意义演绎部分是通过一些统计测量来排列搭配列表。在本文中,我们提出了两种方法之间的比较:(a) logDice方法,这是主要使用的和基于频率的方法,(b) fastText词嵌入方法,这是一种新的和基于语义的方法。对两个斯洛文尼亚语数据集进行了比较,其中一个数据集代表一般语言词首词及其搭配,另一个数据集代表从特殊用途语言语料库中提取的词首词及其搭配。在实验中,我们使用了两种方法:对于定量部分的评估,我们使用了带有曲线下面积(AUC) ROC评分和支持向量机(svm)算法的监督机器学习,在定性部分,两种方法的排名结果由词典编纂者进行评估。结果有些不一致;虽然定量评估证实,基于机器学习的方法比基于频率的方法产生了更好的搭配排名结果,但词典编纂者在大多数情况下认为这两种方法的搭配列表非常相似。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Collocation ranking: frequency vs semantics
Collocations play a very important role in language description, especially in identifying meanings of words. Modern lexicography’s inevitable part of meaning deduction are lists of collocates ranked by some statistical measurement. In the paper, we present a comparison between two approaches to the ranking of collocates: (a) the logDice method, which is dominantly used and frequency-based, and (b) the fastText word embeddings method, which is new and semantic-based. The comparison was made on two Slovene datasets, one representing general language headwords and their collocates, and the other representing headwords and their collocates extracted from a language for special purposes corpus. In the experiment, two methods were used: for the quantitative part of the evaluation, we used supervised machine learning with the area-under-the-curve (AUC) ROC score and support-vector machines (SVMs) algorithm, and in the qualitative part the ranking results of the two methods were evaluated by lexicographers. The results were somewhat inconsistent; while the quantitative evaluation confirmed that the machine-learning-based approach produced better collocate ranking results than the frequency-based one, lexicographers in most cases considered the listings of collocates of both methods very similar.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信