Assessment of the E3C corpus for the recognition of disorders in clinical texts

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2023-07-18 DOI:10.1017/s1351324923000335

Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti

{"title":"Assessment of the E3C corpus for the recognition of disorders in clinical texts","authors":"Roberto Zanoli, A. Lavelli, Daniel Verdi do Amarante, Daniele Toti","doi":"10.1017/s1351324923000335","DOIUrl":null,"url":null,"abstract":"\n Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":" ","pages":""},"PeriodicalIF":1.9000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324923000335","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.

查看原文本刊更多论文

E3C语料库用于识别临床文本中的障碍的评估

命名实体识别（DNER）是生物医学自然语言处理的一项基本任务，引起了人们的广泛关注。该任务包括从非结构化文本中提取疾病、症状和病理功能等疾病的命名实体。欧洲临床病例语料库（E3C）是一个免费提供的多语言语料库（英语、法语、意大利语、西班牙语和巴斯克语），包含语义注释的临床病例文本。临床病例中类型障碍的实体在提及和概念层面都有注释。在提及级别，注释标识实体文本的跨度，例如腹痛。在概念级别，实体文本跨度与统一医学语言系统中的概念标识符相关联，例如C0000737。该语料库可以作为训练和评估信息提取系统的基准。在本工作的背景下，已经进行了多个实验，以测试E3C语料库的提及水平注释用于训练DNER模型的适当性。在这些实验中，将传统的机器学习模型（如条件随机场）和最近基于深度学习的多语言预训练模型与标准基线进行了比较。关于多语言预训练模型，它们进行了微调：（i）对语料库中的每种语言进行微调，以测试每种语言的表现；（ii）对所有语言进行微调以测试多语言学习；（iii）对除目标语言外的所有语言进行调整，以测试跨语言迁移学习。结果显示了E3C语料库用于训练能够从临床病例文本中挖掘障碍实体的系统的适当性。研究人员可以将这些结果作为该语料库的基线来比较他们自己的模型。实现的模型已通过欧洲语言网格平台提供，以便快速方便地访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.