Multilingual Controllable Transformer-Based Lexical Simplification

Proces. del Leng. Natural Pub Date : 2023-07-05 DOI:10.48550/arXiv.2307.02120

Sheang Cheng Kim, Horacio Saggion

{"title":"Multilingual Controllable Transformer-Based Lexical Simplification","authors":"Sheang Cheng Kim, Horacio Saggion","doi":"10.48550/arXiv.2307.02120","DOIUrl":null,"url":null,"abstract":"Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.","PeriodicalId":258781,"journal":{"name":"Proces. del Leng. Natural","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proces. del Leng. Natural","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2307.02120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Text is by far the most ubiquitous source of knowledge and information and should be made easily accessible to as many people as possible; however, texts often contain complex words that hinder reading comprehension and accessibility. Therefore, suggesting simpler alternatives for complex words without compromising meaning would help convey the information to a broader audience. This paper proposes mTLS, a multilingual controllable Transformer-based Lexical Simplification (LS) system fined-tuned with the T5 model. The novelty of this work lies in the use of language-specific prefixes, control tokens, and candidates extracted from pre-trained masked language models to learn simpler alternatives for complex words. The evaluation results on three well-known LS datasets -- LexMTurk, BenchLS, and NNSEval -- show that our model outperforms the previous state-of-the-art models like LSBert and ConLS. Moreover, further evaluation of our approach on the part of the recent TSAR-2022 multilingual LS shared-task dataset shows that our model performs competitively when compared with the participating systems for English LS and even outperforms the GPT-3 model on several metrics. Moreover, our model obtains performance gains also for Spanish and Portuguese.

查看原文本刊更多论文

基于多语种可控变换的词汇简化

文本是迄今为止最普遍的知识和信息来源，应该使尽可能多的人容易获得;然而，文本往往包含复杂的单词，阻碍阅读理解和可理解性。因此，在不影响意思的情况下，为复杂的单词提出更简单的替代方法，将有助于向更广泛的受众传达信息。本文提出了一种基于可变变压器的多语种词汇简化系统，该系统采用T5模型进行微调。这项工作的新颖之处在于使用特定于语言的前缀、控制符号和从预训练的掩码语言模型中提取的候选词来学习复杂单词的更简单替代。在三个著名的LS数据集(LexMTurk, BenchLS和NNSEval)上的评估结果表明，我们的模型优于以前最先进的模型，如LSBert和ConLS。此外，在最近的TSAR-2022多语言LS共享任务数据集上对我们的方法的进一步评估表明，与英语LS的参与系统相比，我们的模型具有竞争力，甚至在几个指标上优于GPT-3模型。此外，我们的模型还获得了西班牙语和葡萄牙语的性能提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proces. del Leng. Natural

自引率

0.00%

发文量