{"title":"The <i>k</i>-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees.","authors":"Elahe Khayatian, Gabriel Valiente, Louxin Zhang","doi":"10.1089/cmb.2023.0312","DOIUrl":"10.1089/cmb.2023.0312","url":null,"abstract":"<p><p>\u0000 <b>Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms that drive the onset and progression of cancer. Modeling tumor cell evolution with labeled trees motivates researchers to develop different measures to compare labeled trees. Although the Robinson-Foulds (RF) distance is widely used for comparing species trees, its applicability to labeled trees reveals certain limitations. This study introduces the <i>k</i>-RF dissimilarity measures, tailored to address the challenges of labeled tree comparison. The RF distance is succinctly expressed as <i>n</i>-RF in the space of labeled trees with <i>n</i> nodes. Like the RF distance, the <i>k</i>-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting <i>k</i> to a small value, the <i>k</i>-RF dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"328-344"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Orthology and Paralogy Relationships at Transcript Level.","authors":"Wend Yam D D Ouedraogo, Aida Ouangraoua","doi":"10.1089/cmb.2023.0400","DOIUrl":"10.1089/cmb.2023.0400","url":null,"abstract":"<p><p>\u0000 <b>Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"277-293"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140861411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Toward Robust Self-Training Paradigm for Molecular Prediction Tasks.","authors":"Hehuan Ma, Feng Jiang, Yu Rong, Yuzhi Guo, Junzhou Huang","doi":"10.1089/cmb.2023.0187","DOIUrl":"10.1089/cmb.2023.0187","url":null,"abstract":"<p><p>Molecular prediction tasks normally demand a series of professional experiments to label the target molecule, which suffers from the limited labeled data problem. One of the semisupervised learning paradigms, known as self-training, utilizes both labeled and unlabeled data. Specifically, a teacher model is trained using labeled data and produces pseudo labels for unlabeled data. These labeled and pseudo-labeled data are then jointly used to train a student model. However, the pseudo labels generated from the teacher model are generally not sufficiently accurate. Thus, we propose a robust self-training strategy by exploring robust loss function to handle such noisy labels in two paradigms, that is, generic and adaptive. We have conducted experiments on three molecular biology prediction tasks with four backbone models to gradually evaluate the performance of the proposed robust self-training strategy. The results demonstrate that the proposed method enhances prediction performance across all tasks, notably within molecular regression tasks, where there has been an average enhancement of 41.5%. Furthermore, the visualization analysis confirms the superiority of our method. Our proposed robust self-training is a simple yet effective strategy that efficiently improves molecular biology prediction performance. It tackles the labeled data insufficient issue in molecular biology by taking advantage of both labeled and unlabeled data. Moreover, it can be easily embedded with any prediction task, which serves as a universal approach for the bioinformatics community.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 3","pages":"213-228"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140293601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unveiling Gene Regulatory Networks That Characterize Difference of Molecular Interplays Between Gastric Cancer Drug Sensitive and Resistance Cell Lines.","authors":"Heewon Park","doi":"10.1089/cmb.2023.0215","DOIUrl":"10.1089/cmb.2023.0215","url":null,"abstract":"<p><p>Gastric cancer is a leading cause of cancer-related deaths globally and chemotherapy is widely accepted as the standard treatment for gastric cancer. However, drug resistance in cancer cells poses a significant obstacle to the success of chemotherapy, limiting its effectiveness in treating gastric cancer. Although many studies have been conducted to unravel the mechanisms of acquired drug resistance, the existing studies were based on abnormalities of a single gene, that is, differential gene expression (DGE) analysis. Single gene-based analysis alone is insufficient to comprehensively understand the mechanisms of drug resistance in cancer cells, because the underlying processes of the mechanism involve perturbations of the molecular interactions. To uncover the mechanism of acquired gastric cancer drug resistance, we perform for identification of differentially regulated gene networks between drug-sensitive and drug-resistant cell lines. We develop a computational strategy for identifying phenotype-specific gene networks by extending the existing method, CIdrgn, that quantifies the dissimilarity of gene networks based on comprehensive information of network structure, that is, regulatory effect between genes, structure of edge, and expression levels of genes. To enhance the efficiency of identifying differentially regulated gene networks and improve the biological relevance of our findings, we integrate additional information and incorporate knowledge of network biology, such as hubness of genes and weighted adjacency matrices. The outstanding capabilities of the developed strategy are validated through Monte Carlo simulations. By using our strategy, we uncover gene regulatory networks that specifically capture the molecular interplays distinguishing drug-sensitive and drug-resistant profiles in gastric cancer. The reliability and significance of the identified drug-sensitive and resistance-specific gene networks, as well as their related markers, are verified through literature. Our analysis for differentially regulated gene network identification has the capacity to characterize the drug-sensitive and resistance-specific molecular interplays related to mechanisms of acquired drug resistance that cannot be revealed by analysis based solely on abnormalities of a single gene, for example, DGE analysis. Through our analysis and comprehensive examination of relevant literature, we suggest that targeting the suppressors of the identified drug-resistant markers, such as the Melanoma Antigen (<i>MAGE</i>) family, Trefoil Factor (<i>TFF</i>) family, and Ras-Associated Binding 25 (<i>RAB25</i>), while enhancing the expression of inducers of the drug sensitivity markers [e.g., Serum Amyloid A (<i>SAA</i>) family], could potentially reduce drug resistance and enhance the effectiveness of chemotherapy for gastric cancer. We expect that the developed strategy will serve as a useful tool for uncovering cancer-related phenotype-specific gene regulatory ","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"257-274"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139939978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Magsarjav Bataa, Siwoo Song, Kunsoo Park, Miran Kim, Jung Hee Cheon, Sun Kim
{"title":"Finding Highly Similar Regions of Genomic Sequences Through Homomorphic Encryption.","authors":"Magsarjav Bataa, Siwoo Song, Kunsoo Park, Miran Kim, Jung Hee Cheon, Sun Kim","doi":"10.1089/cmb.2023.0050","DOIUrl":"10.1089/cmb.2023.0050","url":null,"abstract":"<p><p>Finding highly similar regions of genomic sequences is a basic computation of genomic analysis. Genomic analyses on a large amount of data are efficiently processed in cloud environments, but outsourcing them to a cloud raises concerns over the privacy and security issues. Homomorphic encryption (HE) is a powerful cryptographic primitive that preserves privacy of genomic data in various analyses processed in an untrusted cloud environment. We introduce an efficient algorithm for finding highly similar regions of two homomorphically encrypted sequences, and describe how to implement it using the bit-wise and word-wise HE schemes. In the experiment, our algorithm outperforms an existing algorithm by up to two orders of magnitude in terms of elapsed time. Overall, it finds highly similar regions of the sequences in real data sets in a feasible time.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 3","pages":"197-212"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140293600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GS-TCGA: Gene Set-Based Analysis of The Cancer Genome Atlas.","authors":"Tarrion Baird, Rahul Roychoudhuri","doi":"10.1089/cmb.2023.0278","DOIUrl":"10.1089/cmb.2023.0278","url":null,"abstract":"<p><p>Most tools for analyzing large gene expression datasets, including The Cancer Genome Atlas (TCGA), have focused on analyzing the expression of individual genes or inference of the abundance of specific cell types from whole transcriptome information. While these methods provide useful insights, they can overlook crucial process-based information that may enhance our understanding of cancer biology. In this study, we describe three novel tools incorporated into an online resource; gene set-based analysis of The Cancer Genome Atlas (GS-TCGA). GS-TCGA is designed to enable user-friendly exploration of TCGA data using gene set-based analysis, leveraging gene sets from the Molecular Signatures Database. GS-TCGA includes three unique tools: GS-Surv determines the association between the expression of gene sets and survival in human cancers. Co-correlative gene set enrichment analysis (CC-GSEA) utilizes interpatient heterogeneity in cancer gene expression to infer functions of specific genes based on GSEA of coregulated genes in TCGA. GS-Corr utilizes interpatient heterogeneity in cancer gene expression profiles to identify genes coregulated with the expression of specific gene sets in TCGA. Users are also able to upload custom gene sets for analysis with each tool. These tools empower researchers to perform survival analysis linked to gene set expression, explore the functional implications of gene coexpression, and identify potential gene regulatory mechanisms.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"229-240"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140021922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DERNA Enables Pareto Optimal RNA Design.","authors":"Xinyu Gu, Yuanyuan Qi, Mohammed El-Kebir","doi":"10.1089/cmb.2023.0283","DOIUrl":"10.1089/cmb.2023.0283","url":null,"abstract":"<p><p>The design of an RNA sequence <math><mstyle><mi>v</mi></mstyle></math> that encodes an input target protein sequence <math><mstyle><mi>w</mi></mstyle></math> is a crucial aspect of messenger RNA (mRNA) vaccine development. There are an exponential number of possible RNA sequences for a single target protein due to codon degeneracy. These potential RNA sequences can assume various secondary structure conformations, each with distinct minimum free energy (MFE), impacting thermodynamic stability and mRNA half-life. Furthermore, the presence of species-specific codon usage bias, quantified by the codon adaptation index (CAI), plays a vital role in translation efficiency. While earlier studies focused on optimizing either MFE or CAI, recent research has underscored the advantages of simultaneously optimizing both objectives. However, optimizing one objective comes at the expense of the other. In this work, we present the Pareto Optimal RNA Design problem, aiming to identify the set of Pareto optimal solutions for which no alternative solutions exist that exhibit better MFE and CAI values. Our algorithm DEsign RNA (DERNA) uses the weighted sum method to enumerate the Pareto front by optimizing convex combinations of both objectives. We use dynamic programming to solve each convex combination in <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>3</mn></mrow></msup></mrow><mo>)</mo></mrow></math> time and <math><mstyle><mi>O</mi></mstyle><mrow><mo>(</mo><mrow><mo>|</mo><mstyle><mi>w</mi></mstyle><msup><mrow><mo>|</mo></mrow><mrow><mn>2</mn></mrow></msup></mrow><mo>)</mo></mrow></math> space. Compared with a CDSfold, previous approach that only optimizes MFE, we show on a benchmark data set that DERNA obtains solutions with identical MFE but superior CAI. Moreover, we show that DERNA matches the performance in terms of solution quality of LinearDesign, a recent approach that similarly seeks to balance MFE and CAI. We conclude by demonstrating our method's potential for mRNA vaccine design for the SARS-CoV-2 spike protein.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"179-196"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139990194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hai-Bin Yao, Zhen-Jie Hou, Wen-Guang Zhang, Han Li, Yan Chen
{"title":"Prediction of MicroRNA-Disease Potential Association Based on Sparse Learning and Multilayer Random Walks.","authors":"Hai-Bin Yao, Zhen-Jie Hou, Wen-Guang Zhang, Han Li, Yan Chen","doi":"10.1089/cmb.2023.0266","DOIUrl":"10.1089/cmb.2023.0266","url":null,"abstract":"<p><p>More and more studies have shown that microRNAs (miRNAs) play an indispensable role in the study of complex diseases in humans. Traditional biological experiments to detect miRNA-disease associations are expensive and time-consuming. Therefore, it is necessary to propose efficient and meaningful computational models to predict miRNA-disease associations. In this study, we aim to propose a miRNA-disease association prediction model based on sparse learning and multilayer random walks (SLMRWMDA). The miRNA-disease association matrix is decomposed and reconstructed by the sparse learning method to obtain richer association information, and at the same time, the initial probability matrix for the random walk with restart algorithm is obtained. The disease similarity network, miRNA similarity network, and miRNA-disease association network are used to construct heterogeneous networks, and the stable probability is obtained based on the topological structure features of diseases and miRNAs through a multilayer random walk algorithm to predict miRNA-disease potential association. The experimental results show that the prediction accuracy of this model is significantly improved compared with the previous related models. We evaluated the model using global leave-one-out cross-validation (global LOOCV) and fivefold cross-validation (5-fold CV). The area under the curve (AUC) value for the LOOCV is 0.9368. The mean AUC value for 5-fold CV is 0.9335 and the variance is 0.0004. In the case study, the results show that SLMRWMDA is effective in inferring the potential association of miRNA-disease.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"241-256"},"PeriodicalIF":1.7,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139912747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"iGly-IDN: Identifying Lysine Glycation Sites in Proteins Based on Improved DenseNet.","authors":"Jianhua Jia, Genqiang Wu, Meifang Li","doi":"10.1089/cmb.2023.0112","DOIUrl":"10.1089/cmb.2023.0112","url":null,"abstract":"<p><p>Lysine glycation is one of the most significant protein post-translational modifications, which changes the properties of the proteins and causes them to be dysfunctional. Accurately identifying glycation sites helps to understand the biological function and potential mechanism of glycation in disease treatments. Nonetheless, the experimental methods are ordinarily inefficient and costly, so effective computational methods need to be developed. In this study, we proposed the new model called iGly-IDN based on the improved densely connected convolutional networks (DenseNet). First, one hot encoding was adopted to obtain the original feature maps. Afterward, the improved DenseNet was adopted to capture feature information with the importance degrees during the feature learning. According to the experimental results, Acc reaches 66%, and Mathews correlation coefficient reaches 0.33 on the independent testing data set, which indicates that the iGly-IDN can provide more effective glycation site identification than the current predictors.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"161-174"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138451634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing the Performance of Three Computational Methods for Estimating the Effective Reproduction Number.","authors":"Zihan Wang, Mengxia Xu, Zonglin Yang, Yu Jin, Yong Zhang","doi":"10.1089/cmb.2023.0065","DOIUrl":"10.1089/cmb.2023.0065","url":null,"abstract":"<p><p>The effective reproduction number <math><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow><mo>)</mo></mrow></math> is one of the most important epidemiological parameters, providing suggestions for monitoring the development trend of diseases and also for adjusting the prevention and control policies. However, a few studies have focused on the performance of some common computational methods for <i>R<sub>t</sub></i>. The purpose of this article is to compare the performance of three computational methods for <i>R<sub>t</sub></i>: the time-dependent (TD) method, the new time-varying (NT) method, and the sequential Bayesian (SB) method. Four evaluation methods-accuracy, correlation coefficient, similarity based on trend, and dynamic time warping distance-were used to compare the effectiveness of three computational methods for <i>R<sub>t</sub></i> under different time lags and time windows. The results showed that the NT method was a better choice for real-time monitoring and analysis of the epidemic in the middle and late stages of the infectious disease. The TD method could reflect the change of the number of cases stably and accurately, and was more suitable for monitoring the change of <i>R<sub>t</sub></i> during the whole process of the epidemic outbreak. When the data were relatively stable, the SB method could also provide a reliable estimate for <i>R<sub>t</sub></i>, while the error would increase when the fluctuation in the number of cases increased. The results would provide suggestions for selecting appropriate <i>R<sub>t</sub></i> estimation methods and making policy adjustments more timely and effectively according to the change of <i>R<sub>t</sub></i>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"128-146"},"PeriodicalIF":1.7,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139472540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}