{"title":"Biomedical named entity recognition through improved balanced undersampling for addressing class imbalance and preserving contextual information","authors":"S. M. Archana, Jay Prakash","doi":"10.1007/s41870-024-02137-w","DOIUrl":null,"url":null,"abstract":"<p>Biomedical Named Entity Recognition (Bio-NER) identifies and categorises the named entities of biomedical text data such as disease, chemical, protein, and gene. Since most of the biomedical data originates from the real world, the majority of data instances do not pertain to the specific named entity of interest. Therefore, this imbalance of data adversely impacts the performance of Bio-NER using machine learning models, as their learning objective is usually dominated by the majority of non-entity tokens. Various undersampling techniques have been introduced to address this issue. Balanced Undersampling (BUS) is one of the approaches which operates at the sentence level to enhance biomedical NER (Bio-NER). However, BUS lacks in preserving contextual information during the undersampling procedure. To overcome this limitation, we introduce an improved Balanced Undersampling method (iBUS) for Bio-NER. During the undersampling process, iBUS considers the importance of individual instances and generates a balanced dataset while retaining essential instances. To validate the effectiveness of the proposed method over competitive methods, we perform experiments using the NCBI disease dataset, CHEMDNER, and BC5CDR chemical datasets. The experimental results demonstrate the superiority of the proposed method in terms of the F1 score compared to competitive approaches.</p>","PeriodicalId":14138,"journal":{"name":"International Journal of Information Technology","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s41870-024-02137-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Biomedical Named Entity Recognition (Bio-NER) identifies and categorises the named entities of biomedical text data such as disease, chemical, protein, and gene. Since most of the biomedical data originates from the real world, the majority of data instances do not pertain to the specific named entity of interest. Therefore, this imbalance of data adversely impacts the performance of Bio-NER using machine learning models, as their learning objective is usually dominated by the majority of non-entity tokens. Various undersampling techniques have been introduced to address this issue. Balanced Undersampling (BUS) is one of the approaches which operates at the sentence level to enhance biomedical NER (Bio-NER). However, BUS lacks in preserving contextual information during the undersampling procedure. To overcome this limitation, we introduce an improved Balanced Undersampling method (iBUS) for Bio-NER. During the undersampling process, iBUS considers the importance of individual instances and generates a balanced dataset while retaining essential instances. To validate the effectiveness of the proposed method over competitive methods, we perform experiments using the NCBI disease dataset, CHEMDNER, and BC5CDR chemical datasets. The experimental results demonstrate the superiority of the proposed method in terms of the F1 score compared to competitive approaches.
生物医学命名实体识别(Bio-NER)可识别生物医学文本数据中的命名实体,如疾病、化学物质、蛋白质和基因等,并对其进行分类。由于生物医学数据大多来自现实世界,大多数数据实例与特定的命名实体无关。因此,这种不平衡的数据会对使用机器学习模型的生物 NER 性能产生不利影响,因为它们的学习目标通常被大多数非实体标记所支配。为了解决这个问题,人们引入了各种欠采样技术。均衡欠采样(BUS)是其中一种在句子层面上增强生物医学 NER(Bio-NER)的方法。然而,平衡下采样在下采样过程中无法保留上下文信息。为了克服这一局限性,我们为生物 NER 引入了一种改进的平衡下采样方法(iBUS)。在下采样过程中,iBUS 会考虑单个实例的重要性,并在保留基本实例的同时生成一个平衡的数据集。为了验证所提方法相对于竞争方法的有效性,我们使用 NCBI 疾病数据集、CHEMDNER 和 BC5CDR 化学数据集进行了实验。实验结果表明,就 F1 分数而言,建议的方法优于竞争方法。