Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov
{"title":"Data augmentation and transfer learning for cross-lingual Named Entity Recognition in the biomedical domain","authors":"Brayan Stiven Lancheros, Gloria Corpas Pastor, Ruslan Mitkov","doi":"10.1007/s10579-024-09738-8","DOIUrl":null,"url":null,"abstract":"<p>Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.</p>","PeriodicalId":49927,"journal":{"name":"Language Resources and Evaluation","volume":"47 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Language Resources and Evaluation","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10579-024-09738-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Given the increase in production of data for the biomedical field and the unstoppable growth of the internet, the need for Information Extraction (IE) techniques has skyrocketed. Named Entity Recognition (NER) is one of such IE tasks useful for professionals in different areas. There are several settings where biomedical NER is needed, for instance, extraction and analysis of biomedical literature, relation extraction, organisation of biomedical documents, and knowledge-base completion. However, the computational treatment of entities in the biomedical domain has faced a number of challenges including its high cost of annotation, ambiguity, and lack of biomedical NER datasets in languages other than English. These difficulties have hampered data development, affecting both the domain itself and its multilingual coverage. The purpose of this study is to overcome the scarcity of biomedical data for NER in Spanish, for which only two datasets exist, by developing a robust bilingual NER model. Inspired by back-translation, this paper leverages the progress in Neural Machine Translation (NMT) to create a synthetic version of the Colorado Richly Annotated Full-Text (CRAFT) dataset in Spanish. Additionally, a new CRAFT dataset is constructed by replacing 20% of the entities in the original dataset generating a new augmented dataset. We evaluate two training methods: concatenation of datasets and continuous training to assess the transfer learning capabilities of transformers using the newly obtained datasets. The best performing NER system in the development set achieved an F-1 score of 86.39%. The novel methodology proposed in this paper presents the first bilingual NER system and it has the potential to improve applications across under-resourced languages.
随着生物医学领域数据产量的增加和互联网势不可挡的发展,对信息提取(IE)技术的需求急剧上升。命名实体识别(NER)是此类 IE 任务之一,对不同领域的专业人员都很有用。生物医学 NER 有多种应用场合,例如生物医学文献的提取和分析、关系提取、生物医学文档的组织以及知识库的完善。然而,对生物医学领域的实体进行计算处理面临着许多挑战,包括注释成本高、模棱两可以及缺乏英语以外语言的生物医学 NER 数据集。这些困难阻碍了数据的开发,影响了该领域本身及其多语言覆盖范围。本研究的目的是通过开发一种稳健的双语 NER 模型,克服西班牙语 NER 生物医学数据稀缺的问题(目前仅有两个数据集)。受到反向翻译的启发,本文利用神经机器翻译(NMT)领域的进展,创建了科罗拉多富注释全文(CRAFT)数据集的西班牙语合成版本。此外,我们还通过替换原始数据集中 20% 的实体构建了一个新的 CRAFT 数据集,并生成了一个新的增强数据集。我们评估了两种训练方法:数据集连接和连续训练,以评估转换器使用新获得的数据集进行迁移学习的能力。开发集中表现最好的 NER 系统的 F-1 得分为 86.39%。本文提出的新方法是首个双语 NER 系统,它有望改善资源不足语言的应用。
期刊介绍:
Language Resources and Evaluation is the first publication devoted to the acquisition, creation, annotation, and use of language resources, together with methods for evaluation of resources, technologies, and applications.
Language resources include language data and descriptions in machine readable form used to assist and augment language processing applications, such as written or spoken corpora and lexica, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc., as well as basic software tools for their acquisition, preparation, annotation, management, customization, and use.
Evaluation of language resources concerns assessing the state-of-the-art for a given technology, comparing different approaches to a given problem, assessing the availability of resources and technologies for a given application, benchmarking, and assessing system usability and user satisfaction.