Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model

IF 3.4 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-07-16 DOI:10.1109/TCBB.2024.3429234

Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh

{"title":"Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model","authors":"Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh","doi":"10.1109/TCBB.2024.3429234","DOIUrl":null,"url":null,"abstract":"One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F\n<inline-formula><tex-math>$_{1}$</tex-math></inline-formula>\n-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1934-1941"},"PeriodicalIF":3.4000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10599831/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F

$_{1}$

-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.

查看原文本刊更多论文

增强生物医学实体识别的通用性：自我关注 PCA-CLS 模型。

数据挖掘早期阶段的主要任务之一是识别生物医学语料库中的实体。当使用数据驱动模型（如基于深度学习的架构）从可用的（未）注释数据中学习时，依赖于稳健特征工程的传统方法面临着挑战。尽管利用了大型语料库和先进的深度学习模型，但领域泛化仍然是一个问题。注意力机制能有效捕捉较长的句子依赖关系，并从有限的注释数据集中提取语义和句法信息。为了应对生物医学文本中词汇外的挑战，PCA-CLS（Position and Contextual Attention with CNN-LSTM-Softmax）模型结合了全局自注意力和字符级卷积神经网络技术。该模型的性能在八个不同的生物医学领域数据集上进行了评估，其中包括基因、药物、疾病和物种等实体。PCA-CLS 模型的性能优于几种最先进的模型，取得了显著的 F1 分数，包括 BC2GM 的 88.19%、JNLPBA 的 85.44%、BC5CDR-chemical 的 90.80%、BC5CDR-disease 的 87.07%、BC4CHEMD 的 89.18%、NCBI 的 88.81% 和 s800 数据集的 91.59%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system