Multilingual Clinical NER: Translation or Cross-lingual Transfer?

Clinical Natural Language Processing Workshop Pub Date : 2023-06-07 DOI:10.48550/arXiv.2306.04384

X. Fontaine, Félix Gaschi, Parisa Rastin, Y. Toussaint

{"title":"Multilingual Clinical NER: Translation or Cross-lingual Transfer?","authors":"X. Fontaine, Félix Gaschi, Parisa Rastin, Y. Toussaint","doi":"10.48550/arXiv.2306.04384","DOIUrl":null,"url":null,"abstract":"Natural language tasks like Named Entity Recognition (NER) in the clinical domain on non-English texts can be very time-consuming and expensive due to the lack of annotated data. Cross-lingual transfer (CLT) is a way to circumvent this issue thanks to the ability of multilingual large language models to be fine-tuned on a specific task in one language and to provide high accuracy for the same task in another language. However, other methods leveraging translation models can be used to perform NER without annotated data in the target language, by either translating the training set or test set. This paper compares cross-lingual transfer with these two alternative methods, to perform clinical NER in French and in German without any training data in those languages. To this end, we release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset. Through extensive experiments on this dataset and on a German medical dataset (Frei and Kramer, 2021), we show that translation-based methods can achieve similar performance to CLT but require more care in their design. And while they can take advantage of monolingual clinical language models, those do not guarantee better results than large general-purpose multilingual models, whether with cross-lingual transfer or translation.","PeriodicalId":216954,"journal":{"name":"Clinical Natural Language Processing Workshop","volume":"72 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Natural Language Processing Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.04384","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Natural language tasks like Named Entity Recognition (NER) in the clinical domain on non-English texts can be very time-consuming and expensive due to the lack of annotated data. Cross-lingual transfer (CLT) is a way to circumvent this issue thanks to the ability of multilingual large language models to be fine-tuned on a specific task in one language and to provide high accuracy for the same task in another language. However, other methods leveraging translation models can be used to perform NER without annotated data in the target language, by either translating the training set or test set. This paper compares cross-lingual transfer with these two alternative methods, to perform clinical NER in French and in German without any training data in those languages. To this end, we release MedNERF a medical NER test set extracted from French drug prescriptions and annotated with the same guidelines as an English dataset. Through extensive experiments on this dataset and on a German medical dataset (Frei and Kramer, 2021), we show that translation-based methods can achieve similar performance to CLT but require more care in their design. And while they can take advantage of monolingual clinical language models, those do not guarantee better results than large general-purpose multilingual models, whether with cross-lingual transfer or translation.

查看原文本刊更多论文

多语临床NER:翻译还是跨语迁移?

临床领域的自然语言任务，如命名实体识别(NER)，在非英语文本上，由于缺乏注释数据，可能非常耗时和昂贵。跨语言迁移(CLT)是规避此问题的一种方法，因为多语言大型语言模型能够在一种语言的特定任务上进行微调，并为另一种语言的相同任务提供高精度。然而，利用翻译模型的其他方法可以通过翻译训练集或测试集来执行NER，而不需要在目标语言中注释数据。本文将跨语言迁移与这两种替代方法进行比较，在没有任何法语和德语训练数据的情况下，以法语和德语执行临床NER。为此，我们向MedNERF发布了一个从法国药物处方中提取的医学NER测试集，并使用与英文数据集相同的指南进行了注释。通过对该数据集和德国医学数据集(Frei和Kramer, 2021)的广泛实验，我们表明基于翻译的方法可以达到与CLT相似的性能，但在设计时需要更加小心。虽然他们可以利用单语临床语言模型，但这些模型并不能保证比大型通用多语模型更好的结果，无论是跨语言迁移还是翻译。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Clinical Natural Language Processing Workshop

自引率

0.00%

发文量