MedNER：通过优化平衡和深度主动学习增强医学语料库中的命名实体识别能力

IF 4.4 2区化学 Q2 MATERIALS SCIENCE, MULTIDISCIPLINARY

ACS Applied Polymer Materials Pub Date : 2024-07-17 DOI:10.1145/3678178

Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li

{"title":"MedNER：通过优化平衡和深度主动学习增强医学语料库中的命名实体识别能力","authors":"Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li","doi":"10.1145/3678178","DOIUrl":null,"url":null,"abstract":"\n Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework (\n MedNER\n ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a\n Distinct-K\n optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm,\n Distinct-K Filter\n , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that\n MedNER\n significantly outperforms existing approaches.\n","PeriodicalId":7,"journal":{"name":"ACS Applied Polymer Materials","volume":" 32","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning\",\"authors\":\"Zhuang Yan, Junyan Zhang, Ruogu Lu, Kunlun He, Xiuxing Li\",\"doi\":\"10.1145/3678178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework (\\n MedNER\\n ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a\\n Distinct-K\\n optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm,\\n Distinct-K Filter\\n , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that\\n MedNER\\n significantly outperforms existing approaches.\\n\",\"PeriodicalId\":7,\"journal\":{\"name\":\"ACS Applied Polymer Materials\",\"volume\":\" 32\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Polymer Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3678178\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MATERIALS SCIENCE, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Polymer Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3678178","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

不断增长的电子医疗库为研究人员分析患者病情和药物效果提供了前所未有的机会。与此同时，在大规模电子病历处理阶段也出现了严峻的挑战。首先，包括非正式描述在内的医学术语的新词很难识别。此外，虽然深度模型可以帮助医疗文本中的实体提取，但它需要大规模的标签，而这些标签的获取需要大量时间，而且在医疗领域并非总能获得。然而，当遇到出现大量未见概念或标记数据不足的情况时，现有算法的性能就会出现难以忍受的下降。为了解决上述问题，我们在本文中提出了一种用于医学语料库中命名实体识别的平衡深度主动学习框架（MedNER）。具体来说，为了准确描述我们的选择策略，我们首先将医学句子的不确定性定义为由损失预测模块预测的标记损失，并将多样性定义为基于词-词素嵌入计算的样本批次中成对句子之间的最小文本距离。此外，为了在不确定性和多样性之间做出权衡，我们提出了一个 Distinct-K 优化问题，以最大化所选句子的最小不确定性和多样性。最后，我们提出了一种基于阈值的近似选择算法 Distinct-K Filter，该算法通过平衡多样性和不确定性来选择最有利的训练样本。在真实数据集上的大量实验结果表明，MedNER 的性能明显优于现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MedNER: Enhanced Named Entity Recognition in Medical Corpus via Optimized Balanced and Deep Active Learning

Ever-growing electronic medical corpora provide unprecedented opportunities for researchers to analyze patient conditions and drug effects. Meanwhile, severe challenges emerged in the large-scale electronic medical records process phase. Primarily, emerging words for medical terms, including informal descriptions, are difficult to recognize. Moreover, although deep models can help in entity extraction on medical texts, it requires large-scale labels which are time-intensive to obtain and not always available in the medical domain. However, when encountering a situation where massive unseen concepts appear, or labeled data is insufficient, the performance of existing algorithms will suffer an intolerable decline. In this paper, we propose a balanced and deep active learning framework ( MedNER ) for Named Entity Recognition in the medical corpus to alleviate above problems. Specifically, to describe our selection strategy precisely, we first define the uncertainty of a medical sentence as a labeling loss predicted by a loss-prediction module and define diversity as the least text distance between pairs of sentences in a sample batch computed based on word-morpheme embeddings. Furthermore, aiming to make a trade-off between uncertainty and diversity, we formulate a Distinct-K optimization problem to maximize the slightest uncertainty and diversity of chosen sentences. Finally, we propose a threshold-based approximation selection algorithm, Distinct-K Filter , which selects the most beneficial training samples by balancing diversity and uncertainty. Extensive experimental results on real datasets demonstrate that MedNER significantly outperforms existing approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Applied Polymer Materials Multiple-

CiteScore

7.20

自引率

6.00%

发文量

810

期刊介绍： ACS Applied Polymer Materials is an interdisciplinary journal publishing original research covering all aspects of engineering, chemistry, physics, and biology relevant to applications of polymers. The journal is devoted to reports of new and original experimental and theoretical research of an applied nature that integrates fundamental knowledge in the areas of materials, engineering, physics, bioscience, polymer science and chemistry into important polymer applications. The journal is specifically interested in work that addresses relationships among structure, processing, morphology, chemistry, properties, and function as well as work that provide insights into mechanisms critical to the performance of the polymer for applications.