Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model

IF 3.6 3区 生物学 Q2 BIOCHEMICAL RESEARCH METHODS
Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh
{"title":"Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model","authors":"Rajesh Kumar Mundotiya;Juhi Priya;Divya Kuwarbi;Teekam Singh","doi":"10.1109/TCBB.2024.3429234","DOIUrl":null,"url":null,"abstract":"One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F\n<inline-formula><tex-math>$_{1}$</tex-math></inline-formula>\n-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"1934-1941"},"PeriodicalIF":3.6000,"publicationDate":"2024-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10599831/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F $_{1}$ -scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.
增强生物医学实体识别的通用性:自我关注 PCA-CLS 模型。
数据挖掘早期阶段的主要任务之一是识别生物医学语料库中的实体。当使用数据驱动模型(如基于深度学习的架构)从可用的(未)注释数据中学习时,依赖于稳健特征工程的传统方法面临着挑战。尽管利用了大型语料库和先进的深度学习模型,但领域泛化仍然是一个问题。注意力机制能有效捕捉较长的句子依赖关系,并从有限的注释数据集中提取语义和句法信息。为了应对生物医学文本中词汇外的挑战,PCA-CLS(Position and Contextual Attention with CNN-LSTM-Softmax)模型结合了全局自注意力和字符级卷积神经网络技术。该模型的性能在八个不同的生物医学领域数据集上进行了评估,其中包括基因、药物、疾病和物种等实体。PCA-CLS 模型的性能优于几种最先进的模型,取得了显著的 F1 分数,包括 BC2GM 的 88.19%、JNLPBA 的 85.44%、BC5CDR-chemical 的 90.80%、BC5CDR-disease 的 87.07%、BC4CHEMD 的 89.18%、NCBI 的 88.81% 和 s800 数据集的 91.59%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
7.50
自引率
6.70%
发文量
479
审稿时长
3 months
期刊介绍: IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信