WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.

Wenzheng Bao, Baitong Chen, Yue Zhang
{"title":"WSHNN: A Weakly Supervised Hybrid Neural Network for the Identification of DNA-Protein Binding Sites.","authors":"Wenzheng Bao, Baitong Chen, Yue Zhang","doi":"10.2174/0115734099277249240129114123","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.</p><p><strong>Methods: </strong>We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.</p><p><strong>Results: </strong>Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.</p><p><strong>Conclusion: </strong>From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).</p>","PeriodicalId":93961,"journal":{"name":"Current computer-aided drug design","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current computer-aided drug design","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/0115734099277249240129114123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: Transcription factors are vital biological components that control gene expression, and their primary biological function is to recognize DNA sequences. As related research continues, it was found that the specificity of DNA-protein binding has a significant role in gene expression, regulation, and especially gene therapy. Convolutional Neural Networks (CNNs) have become increasingly popular for predicting DNa-protein-specific binding sites, but their accuracy in prediction needs to be improved.

Methods: We proposed a framework for combining multi-Instance Learning (MIL) and a hybrid neural network named WSHNN. First, we utilized sliding windows to split the DNA sequences into multiple overlapping instances, each instance containing multiple bags. Then, the instances were encoded using a K-mer encoding. Afterward, the scores of all instances in the same bag were calculated separately by a hybrid neural network.

Results: Finally, a fully connected network was utilized as the final prediction for that bag. The framework could achieve the performances of 90.73% in Pre, 82.77% in Recall, 87.17% in Acc, 0.8657 in F1-score, and 0.7462 in MCC, respectively. In addition, we discussed the performance of K-mer encoding. Compared with other art-of-the-state efforts, the model has better performance with sequence information.

Conclusion: From the experimental results, it can be concluded that Bi-directional Long-ShortTerm Memory (Bi-LSTM) can better capture the long-sequence relationships between DNA sequences (the code and data can be visited at https://github.com/baowz12345/Weak_ Super_Network).

WSHNN:用于识别 DNA 蛋白结合位点的弱监督混合神经网络
引言转录因子是控制基因表达的重要生物元件,其主要生物学功能是识别 DNA 序列。随着相关研究的不断深入,人们发现 DNA 蛋白结合的特异性在基因表达、调控,特别是基因治疗中具有重要作用。卷积神经网络(CNN)在预测 DNa 蛋白特异性结合位点方面越来越受欢迎,但其预测准确性有待提高:我们提出了一种将多实例学习(MIL)和名为 WSHNN 的混合神经网络相结合的框架。首先,我们利用滑动窗口将 DNA 序列分割成多个重叠的实例,每个实例包含多个包。然后,使用 K-mer 编码对实例进行编码。然后,通过混合神经网络分别计算同一袋中所有实例的得分:最后,一个全连接网络被用作该袋的最终预测。该框架的预测率为 90.73%,召回率为 82.77%,准确率为 87.17%,F1 分数为 0.8657,MCC 分数为 0.7462。此外,我们还讨论了 K-mer 编码的性能。与其他先进技术相比,该模型在序列信息方面的性能更好:从实验结果来看,双向长短期记忆(Bi-LSTM)能更好地捕捉 DNA 序列之间的长序列关系(代码和数据可访问 https://github.com/baowz12345/Weak_ Super_Network)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信