black[LSCDiscovery shared task] GlossReader at LSCDiscovery: Train to Select a Proper Gloss in English – Discover Lexical Semantic Change in Spanish

Workshop on Computational Approaches to Historical Language Change Pub Date : 1900-01-01 DOI:10.18653/v1/2022.lchange-1.22

M. Rachinskiy, N. Arefyev

{"title":"black[LSCDiscovery shared task] \n GlossReader at LSCDiscovery: Train to Select a Proper Gloss in English – Discover Lexical Semantic Change in Spanish","authors":"M. Rachinskiy, N. Arefyev","doi":"10.18653/v1/2022.lchange-1.22","DOIUrl":null,"url":null,"abstract":"The contextualized embeddings obtained from neural networks pre-trained as Language Models (LM) or Masked Language Models (MLM) are not well suitable for solving the Lexical Semantic Change Detection (LSCD) task because they are more sensitive to changes in word forms rather than word meaning, a property previously known as the word form bias or orthographic bias. Unlike many other NLP tasks, it is also not obvious how to fine-tune such models for LSCD. In order to conclude if there are any differences between senses of a particular word in two corpora, a human annotator or a system shall analyze many examples containing this word from both corpora. This makes annotation of LSCD datasets very labour-consuming. The existing LSCD datasets contain up to 100 words that are labeled according to their semantic change, which is hardly enough for fine-tuning. To solve these problems we fine-tune the XLM-R MLM as part of a gloss-based WSD system on a large WSD dataset in English. Then we employ zero-shot cross-lingual transferability of XLM-R to build the contextualized embeddings for examples in Spanish. In order to obtain the graded change score for each word, we calculate the average distance between our improved contextualized embeddings of its old and new occurrences. For the binary change detection subtask, we apply thresholding to the same scores. Our solution has shown the best results among all other participants in all subtasks except for the optional sense gain detection subtask.","PeriodicalId":120650,"journal":{"name":"Workshop on Computational Approaches to Historical Language Change","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Computational Approaches to Historical Language Change","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.lchange-1.22","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

The contextualized embeddings obtained from neural networks pre-trained as Language Models (LM) or Masked Language Models (MLM) are not well suitable for solving the Lexical Semantic Change Detection (LSCD) task because they are more sensitive to changes in word forms rather than word meaning, a property previously known as the word form bias or orthographic bias. Unlike many other NLP tasks, it is also not obvious how to fine-tune such models for LSCD. In order to conclude if there are any differences between senses of a particular word in two corpora, a human annotator or a system shall analyze many examples containing this word from both corpora. This makes annotation of LSCD datasets very labour-consuming. The existing LSCD datasets contain up to 100 words that are labeled according to their semantic change, which is hardly enough for fine-tuning. To solve these problems we fine-tune the XLM-R MLM as part of a gloss-based WSD system on a large WSD dataset in English. Then we employ zero-shot cross-lingual transferability of XLM-R to build the contextualized embeddings for examples in Spanish. In order to obtain the graded change score for each word, we calculate the average distance between our improved contextualized embeddings of its old and new occurrences. For the binary change detection subtask, we apply thresholding to the same scores. Our solution has shown the best results among all other participants in all subtasks except for the optional sense gain detection subtask.

查看原文本刊更多论文

black[LSCDiscovery共享任务]:在英语中选择适当的光泽训练-发现西班牙语词汇语义变化

从预先训练为语言模型(LM)或掩码语言模型(MLM)的神经网络中获得的上下文化嵌入不太适合解决词汇语义变化检测(LSCD)任务，因为它们对词形而不是词义的变化更敏感，这种特性以前被称为词形偏差或正字法偏差。与许多其他NLP任务不同，如何为LSCD微调这样的模型也不是很明显。为了断定两个语料库中某个特定词的意义是否存在差异，人类注释者或系统需要分析两个语料库中包含该词的许多例子。这使得LSCD数据集的注释非常耗时。现有的LSCD数据集包含多达100个根据其语义变化进行标记的单词，这几乎不足以进行微调。为了解决这些问题，我们在一个大型英语WSD数据集上对XLM-R MLM进行了微调，使其成为基于gloss的WSD系统的一部分。然后，我们使用XLM-R的零射击跨语言可移植性来构建西班牙语示例的上下文化嵌入。为了获得每个单词的分级变化分数，我们计算了改进的上下文化嵌入新旧出现之间的平均距离。对于二进制变化检测子任务，我们对相同的分数应用阈值。我们的解决方案在所有子任务中都显示出最好的结果，除了可选的感知增益检测子任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Computational Approaches to Historical Language Change

自引率

0.00%

发文量