Collection and evaluation of lexical complexity data for Russian language using crowdsourcing

IF 1.5 0 LANGUAGE & LINGUISTICS
A. Abramov, Vladimir Ivanov
{"title":"Collection and evaluation of lexical complexity data for Russian language using crowdsourcing","authors":"A. Abramov, Vladimir Ivanov","doi":"10.22363/2687-0088-30118","DOIUrl":null,"url":null,"abstract":"Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.","PeriodicalId":53426,"journal":{"name":"Russian Journal of Linguistics","volume":"1 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2022-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Russian Journal of Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22363/2687-0088-30118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Estimating word complexity with binary or continuous scores is a challenging task that has been studied for several domains and natural languages. Commonly this task is referred to as Complex Word Identification (CWI) or Lexical Complexity Prediction (LCP). Correct evaluation of word complexity can be an important step in many Lexical Simplification pipelines. Earlier works have usually presented methodologies of lexical complexity estimation with several restrictions: hand-crafted features correlated with word complexity, performed feature engineering to describe target words with features such as number of hypernyms, count of consonants, Named Entity tag, and evaluations with carefully selected target audiences. Modern works investigated the use of transforner-based models that afford extracting features from surrounding context as well. However, the majority of papers have been devoted to pipelines for the English language and few translated them to other languages such as German, French, and Spanish. In this paper we present a dataset of lexical complexity in context based on the Russian Synodal Bible collected using a crowdsourcing platform. We describe a methodology for collecting the data using a 5-point Likert scale for annotation, present descriptive statistics and compare results with analogous work for the English language. We evaluate a linear regression model as a baseline for predicting word complexity on handcrafted features, fastText and ELMo embeddings of target words. The result is a corpus consisting of 931 distinct words that used in 3,364 different contexts.
俄语词汇复杂性数据的众包收集与评价
用二值或连续分数估计词复杂度是一项具有挑战性的任务,已经在多个领域和自然语言中进行了研究。通常,这项任务被称为复杂词识别(CWI)或词汇复杂性预测(LCP)。在许多词法简化管道中,正确评估单词复杂性是一个重要步骤。早期的工作通常提出了一些有限制的词汇复杂性估计方法:手工制作与单词复杂性相关的特征,执行特征工程来描述目标单词的特征,如中音的数量、辅音的数量、命名实体标签,以及与精心选择的目标受众进行评估。现代作品研究了基于变压器的模型的使用,该模型也可以从周围环境中提取特征。然而,大多数论文都致力于英语的管道,很少将它们翻译成其他语言,如德语、法语和西班牙语。在本文中,我们提出了一个使用众包平台收集的基于俄语会议圣经的上下文词汇复杂性数据集。我们描述了一种收集数据的方法,使用5点李克特量表进行注释,提供描述性统计数据,并将结果与英语语言的类似工作进行比较。我们评估了一个线性回归模型作为预测目标词的手工特征、快速文本和ELMo嵌入的词复杂度的基线。结果是一个由931个不同的单词组成的语料库,这些单词在3364个不同的上下文中使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Russian Journal of Linguistics
Russian Journal of Linguistics Arts and Humanities-Language and Linguistics
CiteScore
3.00
自引率
33.30%
发文量
43
审稿时长
14 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信