{"title":"Toward Robust Self-Training Paradigm for Molecular Prediction Tasks.","authors":"Hehuan Ma, Feng Jiang, Yu Rong, Yuzhi Guo, Junzhou Huang","doi":"10.1089/cmb.2023.0187","DOIUrl":"10.1089/cmb.2023.0187","url":null,"abstract":"<p><p>Molecular prediction tasks normally demand a series of professional experiments to label the target molecule, which suffers from the limited labeled data problem. One of the semisupervised learning paradigms, known as self-training, utilizes both labeled and unlabeled data. Specifically, a teacher model is trained using labeled data and produces pseudo labels for unlabeled data. These labeled and pseudo-labeled data are then jointly used to train a student model. However, the pseudo labels generated from the teacher model are generally not sufficiently accurate. Thus, we propose a robust self-training strategy by exploring robust loss function to handle such noisy labels in two paradigms, that is, generic and adaptive. We have conducted experiments on three molecular biology prediction tasks with four backbone models to gradually evaluate the performance of the proposed robust self-training strategy. The results demonstrate that the proposed method enhances prediction performance across all tasks, notably within molecular regression tasks, where there has been an average enhancement of 41.5%. Furthermore, the visualization analysis confirms the superiority of our method. Our proposed robust self-training is a simple yet effective strategy that efficiently improves molecular biology prediction performance. It tackles the labeled data insufficient issue in molecular biology by taking advantage of both labeled and unlabeled data. Moreover, it can be easily embedded with any prediction task, which serves as a universal approach for the bioinformatics community.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 3","pages":"213-228"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140293601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling Gene Regulatory Networks That Characterize Difference of Molecular Interplays Between Gastric Cancer Drug Sensitive and Resistance Cell Lines.","authors":"Heewon Park","doi":"10.1089/cmb.2023.0215","DOIUrl":"10.1089/cmb.2023.0215","url":null,"abstract":"<p><p>Gastric cancer is a leading cause of cancer-related deaths globally and chemotherapy is widely accepted as the standard treatment for gastric cancer. However, drug resistance in cancer cells poses a significant obstacle to the success of chemotherapy, limiting its effectiveness in treating gastric cancer. Although many studies have been conducted to unravel the mechanisms of acquired drug resistance, the existing studies were based on abnormalities of a single gene, that is, differential gene expression (DGE) analysis. Single gene-based analysis alone is insufficient to comprehensively understand the mechanisms of drug resistance in cancer cells, because the underlying processes of the mechanism involve perturbations of the molecular interactions. To uncover the mechanism of acquired gastric cancer drug resistance, we perform for identification of differentially regulated gene networks between drug-sensitive and drug-resistant cell lines. We develop a computational strategy for identifying phenotype-specific gene networks by extending the existing method, CIdrgn, that quantifies the dissimilarity of gene networks based on comprehensive information of network structure, that is, regulatory effect between genes, structure of edge, and expression levels of genes. To enhance the efficiency of identifying differentially regulated gene networks and improve the biological relevance of our findings, we integrate additional information and incorporate knowledge of network biology, such as hubness of genes and weighted adjacency matrices. The outstanding capabilities of the developed strategy are validated through Monte Carlo simulations. By using our strategy, we uncover gene regulatory networks that specifically capture the molecular interplays distinguishing drug-sensitive and drug-resistant profiles in gastric cancer. The reliability and significance of the identified drug-sensitive and resistance-specific gene networks, as well as their related markers, are verified through literature. Our analysis for differentially regulated gene network identification has the capacity to characterize the drug-sensitive and resistance-specific molecular interplays related to mechanisms of acquired drug resistance that cannot be revealed by analysis based solely on abnormalities of a single gene, for example, DGE analysis. Through our analysis and comprehensive examination of relevant literature, we suggest that targeting the suppressors of the identified drug-resistant markers, such as the Melanoma Antigen (<i>MAGE</i>) family, Trefoil Factor (<i>TFF</i>) family, and Ras-Associated Binding 25 (<i>RAB25</i>), while enhancing the expression of inducers of the drug sensitivity markers [e.g., Serum Amyloid A (<i>SAA</i>) family], could potentially reduce drug resistance and enhance the effectiveness of chemotherapy for gastric cancer. We expect that the developed strategy will serve as a useful tool for uncovering cancer-related phenotype-specific gene regulatory ","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"257-274"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139939978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Magsarjav Bataa, Siwoo Song, Kunsoo Park, Miran Kim, Jung Hee Cheon, Sun Kim
{"title":"Finding Highly Similar Regions of Genomic Sequences Through Homomorphic Encryption.","authors":"Magsarjav Bataa, Siwoo Song, Kunsoo Park, Miran Kim, Jung Hee Cheon, Sun Kim","doi":"10.1089/cmb.2023.0050","DOIUrl":"10.1089/cmb.2023.0050","url":null,"abstract":"<p><p>Finding highly similar regions of genomic sequences is a basic computation of genomic analysis. Genomic analyses on a large amount of data are efficiently processed in cloud environments, but outsourcing them to a cloud raises concerns over the privacy and security issues. Homomorphic encryption (HE) is a powerful cryptographic primitive that preserves privacy of genomic data in various analyses processed in an untrusted cloud environment. We introduce an efficient algorithm for finding highly similar regions of two homomorphically encrypted sequences, and describe how to implement it using the bit-wise and word-wise HE schemes. In the experiment, our algorithm outperforms an existing algorithm by up to two orders of magnitude in terms of elapsed time. Overall, it finds highly similar regions of the sequences in real data sets in a feasible time.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 3","pages":"197-212"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140293600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GS-TCGA: Gene Set-Based Analysis of The Cancer Genome Atlas.","authors":"Tarrion Baird, Rahul Roychoudhuri","doi":"10.1089/cmb.2023.0278","DOIUrl":"10.1089/cmb.2023.0278","url":null,"abstract":"<p><p>Most tools for analyzing large gene expression datasets, including The Cancer Genome Atlas (TCGA), have focused on analyzing the expression of individual genes or inference of the abundance of specific cell types from whole transcriptome information. While these methods provide useful insights, they can overlook crucial process-based information that may enhance our understanding of cancer biology. In this study, we describe three novel tools incorporated into an online resource; gene set-based analysis of The Cancer Genome Atlas (GS-TCGA). GS-TCGA is designed to enable user-friendly exploration of TCGA data using gene set-based analysis, leveraging gene sets from the Molecular Signatures Database. GS-TCGA includes three unique tools: GS-Surv determines the association between the expression of gene sets and survival in human cancers. Co-correlative gene set enrichment analysis (CC-GSEA) utilizes interpatient heterogeneity in cancer gene expression to infer functions of specific genes based on GSEA of coregulated genes in TCGA. GS-Corr utilizes interpatient heterogeneity in cancer gene expression profiles to identify genes coregulated with the expression of specific gene sets in TCGA. Users are also able to upload custom gene sets for analysis with each tool. These tools empower researchers to perform survival analysis linked to gene set expression, explore the functional implications of gene coexpression, and identify potential gene regulatory mechanisms.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"229-240"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140021922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DERNA Enables Pareto Optimal RNA Design.","authors":"Xinyu Gu, Yuanyuan Qi, Mohammed El-Kebir","doi":"10.1089/cmb.2023.0283","DOIUrl":"10.1089/cmb.2023.0283","url":null,"abstract":"<p><p>The design of an RNA sequence <math><mstyle><mi>v</mi></mstyle></math> that encodes an input target protein sequence <math><mstyle><mi>w</mi></mstyle></math> is a crucial aspect of messenger RNA (mRNA) vaccine development. There are an exponential number of possible RNA sequences for a single target protein due to codon degeneracy. These potential RNA sequences can assume various secondary structure conformations, each with distinct minimum free energy (MFE), impacting thermodynamic stability and mRNA half-life. Furthermore, the presence of species-specific codon usage bias, quantified by the codon adaptation index (CAI), plays a vital role in translation efficiency. While earlier studies focused on optimizing either MFE or CAI, recent research has underscored the advantages of simultaneously optimizing both objectives. However, optimizing one objective comes at the expense of the other. In this work, we present the Pareto Optimal RNA Design problem, aiming to identify the set of Pareto optimal solutions for which no alternative solutions exist that exhibit better MFE and CAI values. Our algorithm DEsign RNA (DERNA) uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. We use dynamic programming to solve each convex combination in <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>3</mn></mrow></msup></mrow><mo>)</mo></mrow></math> time and <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)</mo></mrow></math> space. Compared with a CDSfold, previous approach that only optimizes MFE, we show on a benchmark data set that DERNA obtains solutions with identical MFE but superior CAI. Moreover, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. We conclude by demonstrating our method's potential for mRNA vaccine design for the SARS-CoV-2 spike protein.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"179-196"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139990194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hai-Bin Yao, Zhen-Jie Hou, Wen-Guang Zhang, Han Li, Yan Chen
{"title":"Prediction of MicroRNA-Disease Potential Association Based on Sparse Learning and Multilayer Random Walks.","authors":"Hai-Bin Yao, Zhen-Jie Hou, Wen-Guang Zhang, Han Li, Yan Chen","doi":"10.1089/cmb.2023.0266","DOIUrl":"10.1089/cmb.2023.0266","url":null,"abstract":"<p><p>More and more studies have shown that microRNAs (miRNAs) play an indispensable role in the study of complex diseases in humans. Traditional biological experiments to detect miRNA-disease associations are expensive and time-consuming. Therefore, it is necessary to propose efficient and meaningful computational models to predict miRNA-disease associations. In this study, we aim to propose a miRNA-disease association prediction model based on sparse learning and multilayer random walks (SLMRWMDA). The miRNA-disease association matrix is decomposed and reconstructed by the sparse learning method to obtain richer association information, and at the same time, the initial probability matrix for the random walk with restart algorithm is obtained. The disease similarity network, miRNA similarity network, and miRNA-disease association network are used to construct heterogeneous networks, and the stable probability is obtained based on the topological structure features of diseases and miRNAs through a multilayer random walk algorithm to predict miRNA-disease potential association. The experimental results show that the prediction accuracy of this model is significantly improved compared with the previous related models. We evaluated the model using global leave-one-out cross-validation (global LOOCV) and fivefold cross-validation (5-fold CV). The area under the curve (AUC) value for the LOOCV is 0.9368. The mean AUC value for 5-fold CV is 0.9335 and the variance is 0.0004. In the case study, the results show that SLMRWMDA is effective in inferring the potential association of miRNA-disease.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"241-256"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139912747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iGly-IDN: Identifying Lysine Glycation Sites in Proteins Based on Improved DenseNet.","authors":"Jianhua Jia, Genqiang Wu, Meifang Li","doi":"10.1089/cmb.2023.0112","DOIUrl":"10.1089/cmb.2023.0112","url":null,"abstract":"<p><p>Lysine glycation is one of the most significant protein post-translational modifications, which changes the properties of the proteins and causes them to be dysfunctional. Accurately identifying glycation sites helps to understand the biological function and potential mechanism of glycation in disease treatments. Nonetheless, the experimental methods are ordinarily inefficient and costly, so effective computational methods need to be developed. In this study, we proposed the new model called iGly-IDN based on the improved densely connected convolutional networks (DenseNet). First, one hot encoding was adopted to obtain the original feature maps. Afterward, the improved DenseNet was adopted to capture feature information with the importance degrees during the feature learning. According to the experimental results, Acc reaches 66%, and Mathews correlation coefficient reaches 0.33 on the independent testing data set, which indicates that the iGly-IDN can provide more effective glycation site identification than the current predictors.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"161-174"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138451634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing the Performance of Three Computational Methods for Estimating the Effective Reproduction Number.","authors":"Zihan Wang, Mengxia Xu, Zonglin Yang, Yu Jin, Yong Zhang","doi":"10.1089/cmb.2023.0065","DOIUrl":"10.1089/cmb.2023.0065","url":null,"abstract":"<p><p>The effective reproduction number <math><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow><mo>)</mo></mrow></math> is one of the most important epidemiological parameters, providing suggestions for monitoring the development trend of diseases and also for adjusting the prevention and control policies. However, a few studies have focused on the performance of some common computational methods for <i>R<sub>t</sub></i>. The purpose of this article is to compare the performance of three computational methods for <i>R<sub>t</sub></i>: the time-dependent (TD) method, the new time-varying (NT) method, and the sequential Bayesian (SB) method. Four evaluation methods-accuracy, correlation coefficient, similarity based on trend, and dynamic time warping distance-were used to compare the effectiveness of three computational methods for <i>R<sub>t</sub></i> under different time lags and time windows. The results showed that the NT method was a better choice for real-time monitoring and analysis of the epidemic in the middle and late stages of the infectious disease. The TD method could reflect the change of the number of cases stably and accurately, and was more suitable for monitoring the change of <i>R<sub>t</sub></i> during the whole process of the epidemic outbreak. When the data were relatively stable, the SB method could also provide a reliable estimate for <i>R<sub>t</sub></i>, while the error would increase when the fluctuation in the number of cases increased. The results would provide suggestions for selecting appropriate <i>R<sub>t</sub></i> estimation methods and making policy adjustments more timely and effectively according to the change of <i>R<sub>t</sub></i>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"128-146"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139472540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junrong Song, Zhiming Song, Jinpeng Zhang, Yuanli Gong
{"title":"Privacy-Preserving Identification of Cancer Subtype-Specific Driver Genes Based on Multigenomics Data with Privatedriver.","authors":"Junrong Song, Zhiming Song, Jinpeng Zhang, Yuanli Gong","doi":"10.1089/cmb.2023.0115","DOIUrl":"10.1089/cmb.2023.0115","url":null,"abstract":"<p><p>Identifying cancer subtype-specific driver genes from a large number of irrelevant passengers is crucial for targeted therapy in cancer treatment. Recently, the rapid accumulation of large-scale cancer genomics data from multiple institutions has presented remarkable opportunities for identification of cancer subtype-specific driver genes. However, the insufficient subtype samples, privacy issues, and heterogenous of aberration events pose great challenges in precisely identifying cancer subtype-specific driver genes. To address this, we introduce privatedriver, the first model for identifying subtype-specific driver genes that integrates genomics data from multiple institutions in a data privacy-preserving collaboration manner. The process of identifying subtype-specific cancer driver genes using privatedriver involves the following two steps: genomics data integration and collaborative training. In the integration process, the aberration events from multiple genomics data sources are combined for each institution using the forward and backward propagation method of NetICS. In the collaborative training process, each institution utilizes the federated learning framework to upload encrypted model parameters instead of raw data of all institutions to train a global model by using the non-negative matrix factorization algorithm. We applied privatedriver on head and neck squamous cell and colon cancer from The Cancer Genome Atlas website and evaluated it with two benchmarks using macro-Fscore. The comparison analysis demonstrates that privatedriver achieves comparable results to centralized learning models and outperforms most other nonprivacy preserving models, all while ensuring the confidentiality of patient information. We also demonstrate that, for varying predicted driver gene distributions in subtype, our model fully considers the heterogeneity of subtype and identifies subtype-specific driver genes corresponding to the given prognosis and therapeutic effect. The success of privatedriver reveals the feasibility and effectiveness of identifying cancer subtype-specific driver genes in a data protection manner, providing new insights for future privacy-preserving driver gene identification studies.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"99-116"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564179","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepPPThermo: A Deep Learning Framework for Predicting Protein Thermostability Combining Protein-Level and Amino Acid-Level Features.","authors":"Xiaoyang Xiang, Jiaxuan Gao, Yanrui Ding","doi":"10.1089/cmb.2023.0097","DOIUrl":"10.1089/cmb.2023.0097","url":null,"abstract":"<p><p>Using wet experimental methods to discover new thermophilic proteins or improve protein thermostability is time-consuming and expensive. Machine learning methods have shown powerful performance in the study of protein thermostability in recent years. However, how to make full use of multiview sequence information to predict thermostability effectively is still a challenge. In this study, we proposed a deep learning-based classifier named DeepPPThermo that fuses features of classical sequence features and deep learning representation features for classifying thermophilic and mesophilic proteins. In this model, deep neural network (DNN) and bi-long short-term memory (Bi-LSTM) are used to mine hidden features. Furthermore, local attention and global attention mechanisms give different importance to multiview features. The fused features are fed to a fully connected network classifier to distinguish thermophilic and mesophilic proteins. Our model is comprehensively compared with advanced machine learning algorithms and deep learning algorithms, proving that our model performs better. We further compare the effects of removing different features on the classification results, demonstrating the importance of each feature and the robustness of the model. Our DeepPPThermo model can be further used to explore protein diversity, identify new thermophilic proteins, and guide directed mutations of mesophilic proteins.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"147-160"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138805058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}