MedNER:通过优化平衡和深度主动学习增强医学语料库中的命名实体识别能力

IF 7.2 4区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li
{"title":"MedNER:通过优化平衡和深度主动学习增强医学语料库中的命名实体识别能力","authors":"Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li","doi":"10.1145/3678178","DOIUrl":null,"url":null,"abstract":"\n Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework (\n MedNER\n ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a\n Distinct-K\n optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm,\n Distinct-K Filter\n , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that\n MedNER\n significantly outperforms existing approaches.\n","PeriodicalId":48967,"journal":{"name":"ACM Transactions on Intelligent Systems and Technology","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning\",\"authors\":\"Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li\",\"doi\":\"10.1145/3678178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework (\\n MedNER\\n ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a\\n Distinct-K\\n optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm,\\n Distinct-K Filter\\n , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that\\n MedNER\\n significantly outperforms existing approaches.\\n\",\"PeriodicalId\":48967,\"journal\":{\"name\":\"ACM Transactions on Intelligent Systems and Technology\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":7.2000,\"publicationDate\":\"2024-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Intelligent Systems and Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3678178\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Intelligent Systems and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3678178","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

不断增长的电子医疗库为研究人员分析患者病情和药物效果提供了前所未有的机会。与此同时,在大规模电子病历处理阶段也出现了严峻的挑战。首先,包括非正式描述在内的医学术语的新词很难识别。此外,虽然深度模型可以帮助医疗文本中的实体提取,但它需要大规模的标签,而这些标签的获取需要大量时间,而且在医疗领域并非总能获得。然而,当遇到出现大量未见概念或标记数据不足的情况时,现有算法的性能就会出现难以忍受的下降。为了解决上述问题,我们在本文中提出了一种用于医学语料库中命名实体识别的平衡深度主动学习框架(MedNER)。具体来说,为了准确描述我们的选择策略,我们首先将医学句子的不确定性定义为由损失预测模块预测的标记损失,并将多样性定义为基于词-词素嵌入计算的样本批次中成对句子之间的最小文本距离。此外,为了在不确定性和多样性之间做出权衡,我们提出了一个 Distinct-K 优化问题,以最大化所选句子的最小不确定性和多样性。最后,我们提出了一种基于阈值的近似选择算法 Distinct-K Filter,该算法通过平衡多样性和不确定性来选择最有利的训练样本。在真实数据集上的大量实验结果表明,MedNER 的性能明显优于现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning
Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework ( MedNER ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a Distinct-K optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm, Distinct-K Filter , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that MedNER significantly outperforms existing approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
9.30
自引率
2.00%
发文量
131
期刊介绍: ACM Transactions on Intelligent Systems and Technology is a scholarly journal that publishes the highest quality papers on intelligent systems, applicable algorithms and technology with a multi-disciplinary perspective. An intelligent system is one that uses artificial intelligence (AI) techniques to offer important services (e.g., as a component of a larger system) to allow integrated systems to perceive, reason, learn, and act intelligently in the real world. ACM TIST is published quarterly (six issues a year). Each issue has 8-11 regular papers, with around 20 published journal pages or 10,000 words per paper. Additional references, proofs, graphs or detailed experiment results can be submitted as a separate appendix, while excessively lengthy papers will be rejected automatically. Authors can include online-only appendices for additional content of their published papers and are encouraged to share their code and/or data with other readers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信