BMC Bioinformatics最新文献

筛选
英文 中文
Rapid bacterial identification through volatile organic compound analysis and deep learning. 通过挥发性有机化合物分析和深度学习快速识别细菌。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05967-4
Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He
{"title":"Rapid bacterial identification through volatile organic compound analysis and deep learning.","authors":"Bowen Yan, Lin Zeng, Yanyi Lu, Min Li, Weiping Lu, Bangfu Zhou, Qinghua He","doi":"10.1186/s12859-024-05967-4","DOIUrl":"10.1186/s12859-024-05967-4","url":null,"abstract":"<p><strong>Background: </strong>The increasing antimicrobial resistance caused by the improper use of antibiotics poses a significant challenge to humanity. Rapid and accurate identification of microbial species in clinical settings is crucial for precise medication and reducing the development of antimicrobial resistance. This study aimed to explore a method for automatic identification of bacteria using Volatile Organic Compounds (VOCs) analysis and deep learning algorithms.</p><p><strong>Results: </strong>AlexNet, where augmentation is applied, produces the best results. The average accuracy rate for single bacterial culture classification reached 99.24% using cross-validation, and the accuracy rates for identifying the three bacteria in randomly mixed cultures were SA:98.6%, EC:98.58% and PA:98.99%, respectively.</p><p><strong>Conclusion: </strong>This work provides a new approach to quickly identify bacterial microorganisms. Using this method can automatically identify bacteria in GC-IMS detection results, helping clinical doctors quickly detect bacterial species, accurately prescribe medication, thereby controlling epidemics, and minimizing the negative impact of bacterial resistance on society.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"347"},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539783/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Prediction of antibody-antigen interaction based on backbone aware with invariant point attention. 基于骨干意识和不变点注意力的抗体-抗原相互作用预测。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-06 DOI: 10.1186/s12859-024-05961-w
Miao Gu, Weiyang Yang, Min Liu
{"title":"Prediction of antibody-antigen interaction based on backbone aware with invariant point attention.","authors":"Miao Gu, Weiyang Yang, Min Liu","doi":"10.1186/s12859-024-05961-w","DOIUrl":"10.1186/s12859-024-05961-w","url":null,"abstract":"<p><strong>Background: </strong>Antibodies play a crucial role in disease treatment, leveraging their ability to selectively interact with the specific antigen. However, screening antibody gene sequences for target antigens via biological experiments is extremely time-consuming and labor-intensive. Several computational methods have been developed to predict antibody-antigen interaction while suffering from the lack of characterizing the underlying structure of the antibody.</p><p><strong>Results: </strong>Beneficial from the recent breakthroughs in deep learning for antibody structure prediction, we propose a novel neural network architecture to predict antibody-antigen interaction. We first introduce AbAgIPA: an antibody structure prediction network to obtain the antibody backbone structure, where the structural features of antibodies and antigens are encoded into representation vectors according to the amino acid physicochemical features and Invariant Point Attention (IPA) computation methods. Finally, the antibody-antigen interaction is predicted by global max pooling, feature concatenation, and a fully connected layer. We evaluated our method on antigen diversity and antigen-specific antibody-antigen interaction datasets. Additionally, our model exhibits a commendable level of interpretability, essential for understanding underlying interaction mechanisms.</p><p><strong>Conclusions: </strong>Quantitative experimental results demonstrate that the new neural network architecture significantly outperforms the best sequence-based methods as well as the methods based on residue contact maps and graph convolution networks (GCNs). The source code is freely available on GitHub at https://github.com/gmthu66/AbAgIPA .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"348"},"PeriodicalIF":2.9,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11542381/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142590097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REDalign: accurate RNA structural alignment using residual encoder-decoder network. REDalign:利用残差编码器-解码器网络进行精确的 RNA 结构配准。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-05 DOI: 10.1186/s12859-024-05956-7
Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong
{"title":"REDalign: accurate RNA structural alignment using residual encoder-decoder network.","authors":"Chun-Chi Chen, Yi-Ming Chan, Hyundoo Jeong","doi":"10.1186/s12859-024-05956-7","DOIUrl":"10.1186/s12859-024-05956-7","url":null,"abstract":"<p><strong>Background: </strong>RNA secondary structural alignment serves as a foundational procedure in identifying conserved structural motifs among RNA sequences, crucially advancing our understanding of novel RNAs via comparative genomic analysis. While various computational strategies for RNA structural alignment exist, they often come with high computational complexity. Specifically, when addressing a set of RNAs with unknown structures, the task of simultaneously predicting their consensus secondary structure and determining the optimal sequence alignment requires an overwhelming computational effort of <math><mrow><mi>O</mi> <mo>(</mo> <msup><mi>L</mi> <mn>6</mn></msup> <mo>)</mo></mrow> </math> for each RNA pair. Such an extremely high computational complexity makes these methods impractical for large-scale analysis despite their accurate alignment capabilities.</p><p><strong>Results: </strong>In this paper, we introduce REDalign, an innovative approach based on deep learning for RNA secondary structural alignment. By utilizing a residual encoder-decoder network, REDalign can efficiently capture consensus structures and optimize structural alignments. In this learning model, the encoder network leverages a hierarchical pyramid to assimilate high-level structural features. Concurrently, the decoder network, enhanced with residual skip connections, integrates multi-level encoded features to learn detailed feature hierarchies with fewer parameter sets. REDalign significantly reduces computational complexity compared to Sankoff-style algorithms and effectively handles non-nested structures, including pseudoknots, which are challenging for traditional alignment methods. Extensive evaluations demonstrate that REDalign provides superior accuracy and substantial computational efficiency.</p><p><strong>Conclusion: </strong>REDalign presents a significant advancement in RNA secondary structural alignment, balancing high alignment accuracy with lower computational demands. Its ability to handle complex RNA structures, including pseudoknots, makes it an effective tool for large-scale RNA analysis, with potential implications for accelerating discoveries in RNA research and comparative genomics.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"346"},"PeriodicalIF":2.9,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11539752/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142581001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PangeBlocks: customized construction of pangenome graphs via maximal blocks. PangeBlocks:通过最大块定制构建泛基因组图。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05958-5
Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti
{"title":"PangeBlocks: customized construction of pangenome graphs via maximal blocks.","authors":"Jorge Avila Cartes, Paola Bonizzoni, Simone Ciccolella, Gianluca Della Vedova, Luca Denti","doi":"10.1186/s12859-024-05958-5","DOIUrl":"10.1186/s12859-024-05958-5","url":null,"abstract":"<p><strong>Background: </strong>The construction of a pangenome graph is a fundamental task in pangenomics. A natural theoretical question is how to formalize the computational problem of building an optimal pangenome graph, making explicit the underlying optimization criterion and the set of feasible solutions. Current approaches build a pangenome graph with some heuristics, without assuming some explicit optimization criteria. Thus it is unclear how a specific optimization criterion affects the graph topology and downstream analysis, like read mapping and variant calling.</p><p><strong>Results: </strong>In this paper, by leveraging the notion of maximal block in a Multiple Sequence Alignment (MSA), we reframe the pangenome graph construction problem as an exact cover problem on blocks called Minimum Weighted Block Cover (MWBC). Then we propose an Integer Linear Programming (ILP) formulation for the MWBC problem that allows us to study the most natural objective functions for building a graph. We provide an implementation of the ILP approach for solving the MWBC and we evaluate it on SARS-CoV-2 complete genomes, showing how different objective functions lead to pangenome graphs that have different properties, hinting that the specific downstream task can drive the graph construction phase.</p><p><strong>Conclusion: </strong>We show that a customized construction of a pangenome graph based on selecting objective functions has a direct impact on the resulting graphs. In particular, our formalization of the MWBC problem, based on finding an optimal subset of blocks covering an MSA, paves the way to novel practical approaches to graph representations of an MSA where the user can guide the construction.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"344"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533710/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states. GPCR-BSD:人类 G 蛋白偶联受体在不同状态下的结合位点数据库。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05962-9
Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang
{"title":"GPCR-BSD: a database of binding sites of human G-protein coupled receptors under diverse states.","authors":"Fan Liu, Han Zhou, Xiaonong Li, Liangliang Zhou, Chungong Yu, Haicang Zhang, Dongbo Bu, Xinmiao Liang","doi":"10.1186/s12859-024-05962-9","DOIUrl":"10.1186/s12859-024-05962-9","url":null,"abstract":"<p><p>G-protein coupled receptors (GPCRs), the largest family of membrane proteins in human body, involve a great variety of biological processes and thus have become highly valuable drug targets. By binding with ligands (e.g., drugs), GPCRs switch between active and inactive conformational states, thereby performing functions such as signal transmission. The changes in binding pockets under different states are important for a better understanding of drug-target interactions. Therefore it is critical, as well as a practical need, to obtain binding sites in human GPCR structures. We report a database (called GPCR-BSD) that collects 127,990 predicted binding sites of 803 GPCRs under active and inactive states (thus 1,606 structures in total). The binding sites were identified from the predicted GPCR structures by executing three geometric-based pocket prediction methods, fpocket, CavityPlus and GHECOM. The server provides query, visualization, and comparison of the predicted binding sites for both GPCR predicted and experimentally determined structures recorded in PDB. We evaluated the identified pockets of 132 experimentally determined human GPCR structures in terms of pocket residue coverage, pocket center distance and redocking accuracy. The evaluation showed that fpocket and CavityPlus methods performed better and successfully predicted orthosteric binding sites in over 60% of the 132 experimentally determined structures. The GPCR Binding Site database is freely accessible at https://gpcrbs.bigdata.jcmsc.cn . This study not only provides a systematic evaluation of the commonly-used fpocket and CavityPlus methods for the first time but also meets the need for binding site information in GPCR studies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"343"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11533411/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIPPIS: protein-protein interaction site prediction network with multi-information fusion. MIPPIS:多信息融合的蛋白质-蛋白质相互作用位点预测网络。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-04 DOI: 10.1186/s12859-024-05964-7
Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song
{"title":"MIPPIS: protein-protein interaction site prediction network with multi-information fusion.","authors":"Shuang Wang, Kaiyu Dong, Dingming Liang, Yunjing Zhang, Xue Li, Tao Song","doi":"10.1186/s12859-024-05964-7","DOIUrl":"10.1186/s12859-024-05964-7","url":null,"abstract":"<p><strong>Background: </strong>The prediction of protein-protein interaction sites plays a crucial role in biochemical processes. Investigating the interaction between viruses and receptor proteins through biological techniques aids in understanding disease mechanisms and guides the development of corresponding drugs. While various methods have been proposed in the past, they often suffer from drawbacks such as long processing times, high costs, and low accuracy.</p><p><strong>Results: </strong>Addressing these challenges, we propose a novel protein-protein interaction site prediction network based on multi-information fusion. In our approach, the initial amino acid features are depicted by the position-specific scoring matrix, hidden Markov model, dictionary of protein secondary structure, and one-hot encoding. Simultaneously, we adopt a multi-channel approach to extract deep-level amino acids features from different perspectives. The graph convolutional network channel effectively extracts spatial structural information. The bidirectional long short-term memory channel treats the amino acid sequence as natural language, capturing the protein's primary structure information. The ProtT5 protein large language model channel outputs a more comprehensive amino acid embedding representation, providing a robust complement to the two aforementioned channels. Finally, the obtained amino acid features are fed into the prediction layer for the final prediction.</p><p><strong>Conclusion: </strong>Compared with six protein structure-based methods and six protein sequence-based methods, our model achieves optimal performance across evaluation metrics, including accuracy, precision, F<sub>1</sub>, Matthews correlation coefficient, and area under the precision recall curve, which demonstrates the superiority of our model.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"345"},"PeriodicalIF":2.9,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11536593/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142575246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search. CUDASW++4.0:基于 GPU 的超快速史密斯-沃特曼蛋白质序列数据库搜索。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-02 DOI: 10.1186/s12859-024-05965-6
Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt
{"title":"CUDASW++4.0: ultra-fast GPU-based Smith-Waterman protein sequence database search.","authors":"Bertil Schmidt, Felix Kallenborn, Alejandro Chacon, Christian Hundt","doi":"10.1186/s12859-024-05965-6","DOIUrl":"10.1186/s12859-024-05965-6","url":null,"abstract":"<p><strong>Background: </strong>The maximal sensitivity for local pairwise alignment makes the Smith-Waterman algorithm a popular choice for protein sequence database search. However, its quadratic time complexity makes it compute-intensive. Unfortunately, current state-of-the-art software tools are not able to leverage the massively parallel processing capabilities of modern GPUs with close-to-peak performance. This motivates the need for more efficient implementations.</p><p><strong>Results: </strong>CUDASW++4.0 is a fast software tool for scanning protein sequence databases with the Smith-Waterman algorithm on CUDA-enabled GPUs. Our approach achieves high efficiency for dynamic programming-based alignment computation by minimizing memory accesses and instructions. We provide both efficient matrix tiling, and sequence database partitioning schemes, and exploit next generation floating point arithmetic and novel DPX instructions. This leads to close-to-peak performance on modern GPU generations (Ampere, Ada, Hopper) with throughput rates of up to 1.94 TCUPS, 5.01 TCUPS, 5.71 TCUPS on an A100, L40S, and H100, respectively. Evaluation on the Swiss-Prot, UniRef50, and TrEMBL databases shows that CUDASW++4.0 gains over an order-of-magnitude performance improvements over previous GPU-based approaches (CUDASW++3.0, ADEPT, SW#DB). In addition, our algorithm demonstrates significant speedups over top-performing CPU-based tools (BLASTP, SWIPE, SWIMM2.0), can exploit multi-GPU nodes with linear scaling, and features an impressive energy efficiency of up to 15.7 GCUPS/Watt.</p><p><strong>Conclusion: </strong>CUDASW++4.0 changes the standing of GPUs in protein sequence database search with Smith-Waterman alignment by providing close-to-peak performance on modern GPUs. It is freely available at https://github.com/asbschmidt/CUDASW4 .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"342"},"PeriodicalIF":2.9,"publicationDate":"2024-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11531700/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving crop production using an agro-deep learning framework in precision agriculture. 在精准农业中利用农业深度学习框架提高作物产量。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-11-01 DOI: 10.1186/s12859-024-05970-9
J Logeshwaran, Durgesh Srivastava, K Sree Kumar, M Jenolin Rex, Amal Al-Rasheed, Masresha Getahun, Ben Othman Soufiene
{"title":"Improving crop production using an agro-deep learning framework in precision agriculture.","authors":"J Logeshwaran, Durgesh Srivastava, K Sree Kumar, M Jenolin Rex, Amal Al-Rasheed, Masresha Getahun, Ben Othman Soufiene","doi":"10.1186/s12859-024-05970-9","DOIUrl":"10.1186/s12859-024-05970-9","url":null,"abstract":"<p><strong>Background: </strong>The study focuses on enhancing the effectiveness of precision agriculture through the application of deep learning technologies. Precision agriculture, which aims to optimize farming practices by monitoring and adjusting various factors influencing crop growth, can greatly benefit from artificial intelligence (AI) methods like deep learning. The Agro Deep Learning Framework (ADLF) was developed to tackle critical issues in crop cultivation by processing vast datasets. These datasets include variables such as soil moisture, temperature, and humidity, all of which are essential to understanding and predicting crop behavior. By leveraging deep learning models, the framework seeks to improve decision-making processes, detect potential crop problems early, and boost agricultural productivity.</p><p><strong>Results: </strong>The study found that the Agro Deep Learning Framework (ADLF) achieved an accuracy of 85.41%, precision of 84.87%, recall of 84.24%, and an F1-Score of 88.91%, indicating strong predictive capabilities for improving crop management. The false negative rate was 91.17% and the false positive rate was 89.82%, highlighting the framework's ability to correctly detect issues while minimizing errors. These results suggest that ADLF can significantly enhance decision-making in precision agriculture, leading to improved crop yield and reduced agricultural losses.</p><p><strong>Conclusions: </strong>The ADLF can significantly improve precision agriculture by leveraging deep learning to process complex datasets and provide valuable insights into crop management. The framework allows farmers to detect issues early, optimize resource use, and improve yields. The study demonstrates that AI-driven agriculture has the potential to revolutionize farming, making it more efficient and sustainable. Future research could focus on further refining the model and exploring its applicability across different types of crops and farming environments.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"341"},"PeriodicalIF":2.9,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11529011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142563763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias. BASE:提供化合物与蛋白质结合亲和力预测数据集的网络服务,可减少相似性偏差。
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-10-30 DOI: 10.1186/s12859-024-05968-3
Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi
{"title":"BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.","authors":"Hyojin Son, Sechan Lee, Jaeuk Kim, Haangik Park, Myeong-Ha Hwang, Gwan-Su Yi","doi":"10.1186/s12859-024-05968-3","DOIUrl":"10.1186/s12859-024-05968-3","url":null,"abstract":"<p><strong>Background: </strong>Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.</p><p><strong>Results: </strong>By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.</p><p><strong>Conclusions: </strong>We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"340"},"PeriodicalIF":2.9,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11526688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142543453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data. 单细胞 RNA 测序数据中基因相互作用网络的收缩估计
IF 2.9 3区 生物学
BMC Bioinformatics Pub Date : 2024-10-26 DOI: 10.1186/s12859-024-05946-9
Duong H T Vo, Thomas Thorne
{"title":"Shrinkage estimation of gene interaction networks in single-cell RNA sequencing data.","authors":"Duong H T Vo, Thomas Thorne","doi":"10.1186/s12859-024-05946-9","DOIUrl":"10.1186/s12859-024-05946-9","url":null,"abstract":"<p><strong>Background: </strong>Gene interaction networks are graphs in which nodes represent genes and edges represent functional interactions between them. These interactions can be at multiple levels, for instance, gene regulation, protein-protein interaction, or metabolic pathways. To analyse gene interaction networks at a large scale, gene co-expression network analysis is often applied on high-throughput gene expression data such as RNA sequencing data. With the advance in sequencing technology, expression of genes can be measured in individual cells. Single-cell RNA sequencing (scRNAseq) provides insights of cellular development, differentiation and characteristics at the transcriptomic level. High sparsity and high-dimensional data structures pose challenges in scRNAseq data analysis.</p><p><strong>Results: </strong>In this study, a sparse inverse covariance matrix estimation framework for scRNAseq data is developed to capture direct functional interactions between genes. Comparative analyses highlight high performance and fast computation of Stein-type shrinkage in high-dimensional data using simulated scRNAseq data. Data transformation approaches also show improvement in performance of shrinkage methods in non-Gaussian distributed data. Zero-inflated modelling of scRNAseq data based on a negative binomial distribution enhances shrinkage performance in zero-inflated data without interference on non zero-inflated count data.</p><p><strong>Conclusion: </strong>The proposed framework broadens application of graphical model in scRNAseq analysis with flexibility in sparsity of count data resulting from dropout events, high performance, and fast computational time. Implementation of the framework is in a reproducible Snakemake workflow https://github.com/calathea24/ZINBGraphicalModel and R package ZINBStein https://github.com/calathea24/ZINBStein .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"339"},"PeriodicalIF":2.9,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11515282/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142494579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信