改进的数据增强方法及其在医学命名实体识别中的应用。

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS
Hongyu Chen, Li Dan, Yonghe Lu, Minghong Chen, Jinxia Zhang
{"title":"改进的数据增强方法及其在医学命名实体识别中的应用。","authors":"Hongyu Chen, Li Dan, Yonghe Lu, Minghong Chen, Jinxia Zhang","doi":"10.1186/s12911-024-02624-x","DOIUrl":null,"url":null,"abstract":"<p><p>Performing data augmentation in medical named entity recognition (NER) is crucial due to the unique challenges posed by this field. Medical data is characterized by high acquisition costs, specialized terminology, imbalanced distributions, and limited training resources. These factors make achieving high performance in medical NER particularly difficult. Data augmentation methods help to mitigate these issues by generating additional training samples, thus balancing data distribution, enriching the training dataset, and improving model generalization. This paper proposes two data augmentation methods-Contextual Random Replacement based on Word2Vec Augmentation (CRR) and Targeted Entity Random Replacement Augmentation (TER)-aimed at addressing the scarcity and imbalance of data in the medical domain. When combined with a deep learning-based Chinese NER model, these methods can significantly enhance performance and recognition accuracy under limited resources. Experimental results demonstrate that both augmentation methods effectively improve the recognition capability of medical named entities. Specifically, the BERT-BiLSTM-CRF model achieved the highest F1 score of 83.587%, representing a 1.49% increase over the baseline model. This validates the importance and effectiveness of data augmentation in medical NER.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302003/pdf/","citationCount":"0","resultStr":"{\"title\":\"An improved data augmentation approach and its application in medical named entity recognition.\",\"authors\":\"Hongyu Chen, Li Dan, Yonghe Lu, Minghong Chen, Jinxia Zhang\",\"doi\":\"10.1186/s12911-024-02624-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Performing data augmentation in medical named entity recognition (NER) is crucial due to the unique challenges posed by this field. Medical data is characterized by high acquisition costs, specialized terminology, imbalanced distributions, and limited training resources. These factors make achieving high performance in medical NER particularly difficult. Data augmentation methods help to mitigate these issues by generating additional training samples, thus balancing data distribution, enriching the training dataset, and improving model generalization. This paper proposes two data augmentation methods-Contextual Random Replacement based on Word2Vec Augmentation (CRR) and Targeted Entity Random Replacement Augmentation (TER)-aimed at addressing the scarcity and imbalance of data in the medical domain. When combined with a deep learning-based Chinese NER model, these methods can significantly enhance performance and recognition accuracy under limited resources. Experimental results demonstrate that both augmentation methods effectively improve the recognition capability of medical named entities. Specifically, the BERT-BiLSTM-CRF model achieved the highest F1 score of 83.587%, representing a 1.49% increase over the baseline model. This validates the importance and effectiveness of data augmentation in medical NER.</p>\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11302003/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-024-02624-x\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02624-x","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

在医学命名实体识别(NER)中进行数据扩增至关重要,因为这一领域面临着独特的挑战。医学数据的特点是获取成本高、术语专业、分布不平衡以及训练资源有限。这些因素使得医疗 NER 实现高性能变得尤为困难。数据增强方法通过生成额外的训练样本来缓解这些问题,从而平衡数据分布、丰富训练数据集和提高模型泛化能力。本文提出了两种数据扩增方法--基于 Word2Vec 的上下文随机替换扩增法(CRR)和目标实体随机替换扩增法(TER),旨在解决医疗领域数据稀缺和不平衡的问题。这些方法与基于深度学习的中文 NER 模型相结合,可以在有限的资源条件下显著提高性能和识别准确率。实验结果表明,这两种增强方法都能有效提高医学命名实体的识别能力。具体来说,BERT-BiLSTM-CRF 模型的 F1 分数最高,达到 83.587%,比基线模型提高了 1.49%。这验证了数据增强在医学 NER 中的重要性和有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
An improved data augmentation approach and its application in medical named entity recognition.

Performing data augmentation in medical named entity recognition (NER) is crucial due to the unique challenges posed by this field. Medical data is characterized by high acquisition costs, specialized terminology, imbalanced distributions, and limited training resources. These factors make achieving high performance in medical NER particularly difficult. Data augmentation methods help to mitigate these issues by generating additional training samples, thus balancing data distribution, enriching the training dataset, and improving model generalization. This paper proposes two data augmentation methods-Contextual Random Replacement based on Word2Vec Augmentation (CRR) and Targeted Entity Random Replacement Augmentation (TER)-aimed at addressing the scarcity and imbalance of data in the medical domain. When combined with a deep learning-based Chinese NER model, these methods can significantly enhance performance and recognition accuracy under limited resources. Experimental results demonstrate that both augmentation methods effectively improve the recognition capability of medical named entities. Specifically, the BERT-BiLSTM-CRF model achieved the highest F1 score of 83.587%, representing a 1.49% increase over the baseline model. This validates the importance and effectiveness of data augmentation in medical NER.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信