xMEN:用于跨语言医疗实体规范化的模块化工具包。

IF 2.5 Q2 HEALTH CARE SCIENCES & SERVICES
JAMIA Open Pub Date : 2024-12-26 eCollection Date: 2025-02-01 DOI:10.1093/jamiaopen/ooae147
Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P Schapranow
{"title":"xMEN:用于跨语言医疗实体规范化的模块化工具包。","authors":"Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P Schapranow","doi":"10.1093/jamiaopen/ooae147","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.</p><p><strong>Materials and methods: </strong>We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.</p><p><strong>Results: </strong>xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.</p><p><strong>Discussion: </strong>We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.</p><p><strong>Conclusion: </strong>xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.</p>","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 1","pages":"ooae147"},"PeriodicalIF":2.5000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671143/pdf/","citationCount":"0","resultStr":"{\"title\":\"xMEN: a modular toolkit for cross-lingual medical entity normalization.\",\"authors\":\"Florian Borchert, Ignacio Llorca, Roland Roller, Bert Arnrich, Matthieu-P Schapranow\",\"doi\":\"10.1093/jamiaopen/ooae147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.</p><p><strong>Materials and methods: </strong>We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.</p><p><strong>Results: </strong>xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.</p><p><strong>Discussion: </strong>We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.</p><p><strong>Conclusion: </strong>xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.</p>\",\"PeriodicalId\":36278,\"journal\":{\"name\":\"JAMIA Open\",\"volume\":\"8 1\",\"pages\":\"ooae147\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2024-12-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11671143/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JAMIA Open\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jamiaopen/ooae147\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooae147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

目的:提高跨多种语言的医疗实体规范化性能,特别是在语言资源比英语少的情况下。材料和方法:我们提出了xMEN,一个跨语言(x)医疗实体规范化(MEN)的模块化系统,可适应低资源和高资源场景。为了解释许多目标语言和术语的别名的稀缺性,我们通过跨语言候选生成来利用多语言别名。对于候选排序,如果目标任务的注释可用,我们将合并一个可训练的交叉编码器(CE)模型。为了平衡通用候选生成器和后续可训练的重新排序器的输出,我们在训练ce的损失函数中引入了一个新的秩正则化项。为了在没有金标准注释的情况下重新排序,我们使用机器翻译和投影来自高资源语言的注释引入了多个新的弱标记数据集。结果:xMEN在几种欧洲语言的各种基准数据集上提高了最先进的性能。当目标任务没有训练数据时,弱监督ce是有效的。讨论:我们对规范化误差进行了分析,揭示了复杂实体在规范化方面仍然具有挑战性。新的模块和基准数据集可以很容易地集成在未来。结论:xMEN在许多语言的医疗实体规范化方面表现出很强的性能,即使在没有标记数据和目标语言的术语别名可用的情况下也是如此。为了在将来实现可重复的基准测试,我们将该系统作为开源Python工具包提供。预训练的模型和源代码可在网上获得:https://github.com/hpi-dhc/xmen。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
xMEN: a modular toolkit for cross-lingual medical entity normalization.

Objective: To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English.

Materials and methods: We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language.

Results: xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task.

Discussion: We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future.

Conclusion: xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
JAMIA Open
JAMIA Open Medicine-Health Informatics
CiteScore
4.10
自引率
4.80%
发文量
102
审稿时长
16 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信