Hybrid deep learning models for text-based identification of gene-disease associations.

IF 2.2 4区工程技术 Q3 PHARMACOLOGY & PHARMACY

Bioimpacts Pub Date : 2025-06-28 eCollection Date: 2025-01-01 DOI:10.34172/bi.31226

Noor Fadhil Jumaa, Jafar Razmara, Sepideh Parvizpour, Jaber Karimpour

{"title":"Hybrid deep learning models for text-based identification of gene-disease associations.","authors":"Noor Fadhil Jumaa, Jafar Razmara, Sepideh Parvizpour, Jaber Karimpour","doi":"10.34172/bi.31226","DOIUrl":null,"url":null,"abstract":"Introduction: Identifying gene-disease associations is crucial for advancing medical research and improving clinical outcomes. Nevertheless, the rapid expansion of biomedical literature poses significant obstacles to extracting meaningful relationships from extensive text collections.Methods: This study uses deep learning techniques to automate this process, using publicly available datasets (EU-ADR, GAD, and SNPPhenA) to classify these associations accurately. Each dataset underwent rigorous pre-processing, including entity identification and preparation, word embedding using pre-trained Word2Vec and fastText models, and position embedding to capture semantic and contextual relationships within the text. In this research, three deep learning-based hybrid models have been implemented and contrasted, including CNN-LSTM, CNN-GRU, and CNN-GRU-LSTM. Each model has been equipped with attentional mechanisms to enhance its performance.Results: Our findings reveal that the CNN-GRU model achieved the highest accuracy of 91.23% on the SNPPhenA dataset, while the CNN-GRU-LSTM model attained an accuracy of 90.14% on the EU-ADR dataset. Meanwhile, the CNN-LSTM model demonstrated superior performance on the GAD dataset, achieving an accuracy of 84.90%. Compared to previous state-of-the-art methods, such as BioBERT-based models, our hybrid approach demonstrates superior classification performance by effectively capturing local and sequential features without relying on heavy pre-training.Conclusion: The developed models and their evaluation data are available at https://github.com/NoorFadhil/Deep-GDAE.","PeriodicalId":48614,"journal":{"name":"Bioimpacts","volume":"15 ","pages":"31226"},"PeriodicalIF":2.2000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12319213/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioimpacts","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.34172/bi.31226","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Identifying gene-disease associations is crucial for advancing medical research and improving clinical outcomes. Nevertheless, the rapid expansion of biomedical literature poses significant obstacles to extracting meaningful relationships from extensive text collections.

Methods: This study uses deep learning techniques to automate this process, using publicly available datasets (EU-ADR, GAD, and SNPPhenA) to classify these associations accurately. Each dataset underwent rigorous pre-processing, including entity identification and preparation, word embedding using pre-trained Word2Vec and fastText models, and position embedding to capture semantic and contextual relationships within the text. In this research, three deep learning-based hybrid models have been implemented and contrasted, including CNN-LSTM, CNN-GRU, and CNN-GRU-LSTM. Each model has been equipped with attentional mechanisms to enhance its performance.

Results: Our findings reveal that the CNN-GRU model achieved the highest accuracy of 91.23% on the SNPPhenA dataset, while the CNN-GRU-LSTM model attained an accuracy of 90.14% on the EU-ADR dataset. Meanwhile, the CNN-LSTM model demonstrated superior performance on the GAD dataset, achieving an accuracy of 84.90%. Compared to previous state-of-the-art methods, such as BioBERT-based models, our hybrid approach demonstrates superior classification performance by effectively capturing local and sequential features without relying on heavy pre-training.

Conclusion: The developed models and their evaluation data are available at https://github.com/NoorFadhil/Deep-GDAE.

Abstract Image

查看原文本刊更多论文

基于文本的基因-疾病关联识别的混合深度学习模型。

识别基因与疾病的关联对于推进医学研究和改善临床结果至关重要。然而，快速扩张的生物医学文献对从广泛的文本集合中提取有意义的关系构成了重大障碍。方法：本研究使用深度学习技术自动化这一过程，使用公开可用的数据集（EU-ADR， GAD和SNPPhenA）准确分类这些关联。每个数据集都经过严格的预处理，包括实体识别和准备，使用预训练的Word2Vec和fastText模型进行词嵌入，以及位置嵌入以捕获文本中的语义和上下文关系。本研究实现了CNN-LSTM、CNN-GRU和CNN-GRU- lstm三种基于深度学习的混合模型并进行了对比。每个模型都配备了注意机制，以提高其性能。结果：CNN-GRU模型在SNPPhenA数据集上的准确率最高，为91.23%，而CNN-GRU- lstm模型在EU-ADR数据集上的准确率为90.14%。同时，CNN-LSTM模型在GAD数据集上表现出优异的性能，准确率达到84.90%。与之前最先进的方法（如基于biobert的模型）相比，我们的混合方法通过有效地捕获局部和顺序特征而不依赖于大量的预训练，展示了卓越的分类性能。结论：建立的模型及其评价数据可在https://github.com/NoorFadhil/Deep-GDAE上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioimpacts Pharmacology, Toxicology and Pharmaceutics-Pharmaceutical Science

CiteScore

4.80

自引率

7.70%

发文量

审稿时长

5 weeks

期刊介绍： BioImpacts (BI) is a peer-reviewed multidisciplinary international journal, covering original research articles, reviews, commentaries, hypotheses, methodologies, and visions/reflections dealing with all aspects of biological and biomedical researches at molecular, cellular, functional and translational dimensions.