Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas
{"title":"Generating Heterogeneous Data on Gene Trees.","authors":"Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas","doi":"10.1089/cmb.2024.0843","DOIUrl":"https://doi.org/10.1089/cmb.2024.0843","url":null,"abstract":"<p><p>We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index.","authors":"Li-Lu Guo","doi":"10.1089/cmb.2024.0925","DOIUrl":"https://doi.org/10.1089/cmb.2024.0925","url":null,"abstract":"<p><p>Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (<i>LF</i>) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant <i>LF</i> mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming <i>LF</i> mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144009455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ida Egendal, Rasmus Froberg Brøndum, Marta Pelizzola, Asger Hobolth, Martin Bøgsted
{"title":"On the Relation Between Linear Autoencoders and Non-Negative Matrix Factorization for Mutational Signature Extraction.","authors":"Ida Egendal, Rasmus Froberg Brøndum, Marta Pelizzola, Asger Hobolth, Martin Bøgsted","doi":"10.1089/cmb.2024.0784","DOIUrl":"10.1089/cmb.2024.0784","url":null,"abstract":"<p><p>Since its introduction, non-negative matrix factorization (NMF) has been a popular tool for extracting interpretable, low-dimensional representations of high-dimensional data. However, several recent studies have proposed replacing NMF with autoencoders. The increasing popularity of autoencoders warrants an investigation on whether this replacement is in general valid and reasonable. Moreover, the exact relationship between non-negative autoencoders and NMF has not been thoroughly explored. Thus, a main aim of this study is to investigate in detail the relationship between autoencoders and NMF. We define a non-negative linear autoencoder, AE-NMF, which is mathematically equivalent with convex NMF, a constrained version of NMF. The performance of NMF and the non-negative linear autoencoder is compared within the context of mutational signature extraction from simulated and real-world cancer genomics data. We find that the reconstructions based on NMF are more accurate compared with AE-NMF, while the signatures extracted using both methods exhibit comparable consistency and performance when externally validated. These findings suggest that AE-NMF, the linear non-negative autoencoders investigated in this article, do not provide an improvement of NMF in the field of mutational signature extraction. Our study serves as a foundation for understanding the theoretical implication of replacing NMF with non-negative autoencoders.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"461-472"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143669971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressed Representation of Extreme Learning Machine with Self-Diffusion Graph Denoising Applied for Dissecting Molecular Heterogeneity.","authors":"Xin Duan, Xinnan Ding, Yuelin Lu","doi":"10.1089/cmb.2024.0729","DOIUrl":"10.1089/cmb.2024.0729","url":null,"abstract":"<p><p>Molecular heterogeneity exists in many biological systems, such as major malignancies or diverse cell populations. Clustering of gene expression profiles has been widely used to dissect molecular heterogeneity. One drawback common to most clustering methods is that they often suffer from high dimensionality and noise, as well as feature redundancy. To address these challenges, we propose Extreme learning machine self-diffusion (ELMSD), an auto-encoder extreme learning machine feature representation method that incorporates a self-diffusion graph denoising framework to effectively dissect molecular heterogeneity. Our method, ELMSD, first learns a compressed representation of gene expression profiles from the hidden layer of the autoencoder extreme learning machine, followed by an iterative graph diffusion process to enhance the sample-to-sample similarity. The enhanced graph can largely facilitate the downstream clustering analysis, making it more efficient to analyze molecular properties. To demonstrate the utility of ELMSD, we applied it on one simulation dataset, five single-cell datasets, and 20 cancer datasets. Experiment results show that the ELMSD approach outperforms several state-of-the-art clustering methods and cancer subtypes, cell types identified by ELMSD reveal strong clinical relevance and biological interpretation. The ELMSD code is available at: https://github.com/DXCODEE/ELMSD.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"486-497"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143657382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Dynamics of HIV-Tuberculosis Coinfection Model with Temporal Recovery from Tuberculosis: An Analysis.","authors":"Pankaj Singh Rana, Nitin Sharma, Sunil Singh Negi, Haci Mehmet Baskonus","doi":"10.1089/cmb.2024.0763","DOIUrl":"https://doi.org/10.1089/cmb.2024.0763","url":null,"abstract":"<p><p>The current study is an attempt to frame a deterministic compartmental model for HIV-TB coinfection, considering temporary recovery from Tuberculosis (TB) after treatment (the possibility of reinfection with TB even after recovery). The proposed HIV<b>-</b>TB coinfection model is a composite of an susceptible-infected (SI) type HIV/AIDS model and a susceptible-exposed-infected-recovered type TB model. In the beginning, the HIV<b>-</b>TB model is constructed, followed by the qualitative investigation of the model. The equilibrium points of the model are obtained and have been examined in detail. Further, the basic reproduction number for the HIV<b>-</b>TB coinfection model has been computed, and the proposed model has been simulated numerically to investigate the effect of treatment on HIV<b>-</b>TB coinfection. Analysis of the model claims the existence of interior equilibrium when both HIV and TB reproduction numbers are more than unity. The results exhibit that TB treatment will be the most efficient in discarding the HIV<b>-</b>TB coinfection disease whenever the basic reproduction of HIV<b>-</b>TB is less than one. In addition, our results suggest that the reinfection of TB after recovery impacts HIV<b>-</b>TB transmission. It has been found that reinfection makes disease eradication more challenging. As, in the presence of reinfection, the total infected cases are always higher than the infected cases in the absence of reinfection.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"32 5","pages":"537-555"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143981581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Disambiguating a Soft Metagenomic Clustering.","authors":"Rahul Nihalani, Jaroslaw Zola, Srinivas Aluru","doi":"10.1089/cmb.2024.0825","DOIUrl":"10.1089/cmb.2024.0825","url":null,"abstract":"<p><p>Clustering is a popular technique used for analyzing amplicon sequencing data in metagenomics. Specifically, it is used to assign sequences (<i>reads</i>) to clusters, each cluster representing a species or a higher level taxonomic unit. Reads from multiple species often sharing subsequences, combined with lack of a perfect similarity measure, make it difficult to correctly assign reads to clusters. Thus, metagenomic clustering methods must either resort to ambiguity, or make the best available choice at each read assignment stage, which could lead to incorrect clusters and potentially cascading errors. In this article, we argue for first generating an ambiguous clustering and then resolving the ambiguities collectively by analyzing the ambiguous clusters. We propose a rigorous formulation of this problem and show that it is <i>NP</i>-Hard. We then propose an efficient heuristic to solve it in practice. We validate our approach on several synthetically generated datasets and two datasets consisting of 16S rDNA sequences from the microbiome of rat guts.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"473-485"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143573107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster and More Accurate Estimation of Protein Hinges Based on Information Criteria.","authors":"Bunsho Koyano, Tetsuo Shibuya","doi":"10.1089/cmb.2024.0731","DOIUrl":"https://doi.org/10.1089/cmb.2024.0731","url":null,"abstract":"<p><p>Protein hinges are flexible parts connecting several rigid substructures of proteins that are crucial to determine protein function. Various methods have been developed for efficiently and accurately estimating protein hinge positions by comparing two different conformations of the same protein for a growing number of protein structures. However, few studies have focused on accurately estimating the number of hinges, and it is required to accurately estimate both the number and positions of hinges. We propose faster and more accurate algorithms for estimating the number and positions of hinges by utilizing information criteria that run in <i>O</i>(<i>n</i><sup>2</sup>)-time, where <i>n</i> is the protein length. Our algorithms utilize Bayesian Information Criterion (BIC) or Akaike's Information Criterion based on a newly proposed <i>k</i>-hinge structure generation model that models the hinge motions between two protein conformations. Our exact algorithm based on BIC outperformed the most accurate previous method in terms of both hinge number and position accuracy on our simulation dataset. Our exact algorithm was approximately as fast as the previous fastest method, DynDom, on our simulation dataset. We evaluated the hinge number and position accuracy of our exact algorithm and previous methods on one hinge-annotated dataset. The hinge number and position accuracy of our exact algorithm were comparable to the most accurate previous method on the hinge-annotated dataset. We further propose even faster <i>O</i>(<i>n</i>)-time heuristic algorithms, where <i>n</i> is the protein length. Our heuristic algorithm achieved almost the same hinge number and position accuracy as our exact algorithm, and was over 18 times faster than our exact algorithm and DynDom.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"32 5","pages":"498-519"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144004036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Structure Feature Introduced to Predict Protein-Protein Interaction Sites.","authors":"Lingwei Lai, Jing Geng, Haochen Duan, Siyuan Chen, Lvwen Huang, Jiantao Yu","doi":"10.1089/cmb.2024.0804","DOIUrl":"10.1089/cmb.2024.0804","url":null,"abstract":"<p><p>Interaction between proteins often depends on the sequence features and structure features of proteins. Both of these features are helpful for machine learning methods to predict (protein-protein interaction) PPI sites. In this study, we introduced a new structure feature: concave-convex feature on the protein surface, which was computed by the structural data of proteins in Protein Data Bank database. And then, a prediction model combining protein sequence features and structure features was constructed, named SSPPI_Ensemble (Sequence and Structure geometric feature-based PPI site prediction). Three sequence features, i.e., PSSMs (Position-Specific Scoring Matrices), HMM (Hidden Markov Models) and raw protein sequence, were used. The Dictionary of Secondary Structure in Proteins and the concave-convex feature were used as the structure feature. Compared with the other prediction methods, our method has achieved better performance or showed the obvious advantages on the same test datasets, confirming the proposed concave-convex feature is useful in predicting PPI sites.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"520-536"},"PeriodicalIF":1.4,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143501476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VTrans: A VAE-Based Pre-Trained Transformer Method for Microbiome Data Analysis.","authors":"Xinyuan Shi, Fangfang Zhu, Wenwen Min","doi":"10.1089/cmb.2024.0884","DOIUrl":"https://doi.org/10.1089/cmb.2024.0884","url":null,"abstract":"<p><p>Predicting the survival outcomes and assessing the risk of patients play a pivotal role in comprehending the microbial composition across various stages of cancer. With the ongoing advancements in deep learning, it has been substantiated that deep learning holds the potential to analyze patient survival risks based on microbial data. However, confronting a common challenge in individual cancer datasets involves the limited sample size and the high dimensionality of the feature space. This predicament often leads to overfitting issues in deep learning models, hindering their ability to effectively extract profound data representations and resulting in suboptimal model performance. To overcome these challenges, we advocate the utilization of pretraining and fine-tuning strategies, which have proven effective in addressing the constraint of having a smaller sample size in individual cancer datasets. In this study, we propose a deep learning model that amalgamates Transformer encoder and variational autoencoder (VAE), VTrans, employing both pre-training and fine-tuning strategies to predict the survival risk of cancer patients using microbial data. Furthermore, we highlight the potential of extending VTrans to integrate microbial multi-omics data. Our method is assessed on three distinct cancer datasets from The Cancer Genome Atlas Program, and the research findings demonstrated that (1) VTrans excels in terms of performance compared to conventional machine learning and other deep learning models. (2) The utilization of pretraning significantly enhances its performance. (3) In contrast to positional encoding, employing VAE encoding proves to be more effective in enriching data representation. (4) Using the idea of saliency map, it is possible to observe which microbes have a high contribution to the classification results. These results demonstrate the effectiveness of VTrans in prediting patient survival risk. Source code and all datasets used in this paper are available at https://github.com/wenwenmin/VTrans and https://doi.org/10.5281/zenodo.14166580.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144002934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FedOpenHAR: Federated Multitask Transfer Learning for Sensor-Based Human Activity Recognition.","authors":"Egemen İşgÜder, Özlem Durmaz İncel","doi":"10.1089/cmb.2024.0631","DOIUrl":"https://doi.org/10.1089/cmb.2024.0631","url":null,"abstract":"<p><p>Wearable and mobile devices equipped with motion sensors offer important insights into user behavior. Machine learning and, more recently, deep learning techniques have been applied to analyze sensor data. Typically, the focus is on a single task, such as human activity recognition (HAR), and the data is processed centrally on a server or in the cloud. However, the same sensor data can be leveraged for multiple tasks, and distributed machine learning methods can be employed without the need for transmitting data to a central location. In this study, we introduce the FedOpenHAR framework, which explores federated transfer learning in a multitask setting for both sensor-based HAR and device position identification tasks. This approach utilizes transfer learning by training task-specific and personalized layers in a federated manner. The OpenHAR framework, which includes ten smaller datasets, is used for training the models. The main challenge is developing robust models that are applicable to both tasks across different datasets, which may contain only a subset of label types. Multiple experiments are conducted in the Flower federated learning environment using the DeepConvLSTM architecture. Results are presented for both federated and centralized training under various parameters and constraints. By employing transfer learning and training task-specific and personalized federated models, we achieve a higher accuracy (72.4%) compared to a fully centralized training approach (64.5%), and similar accuracy to a scenario where each client performs individual training in isolation (72.6%). However, the advantage of FedOpenHAR over individual training is that, when a new client joins with a new label type (representing a new task), it can begin training from the already existing common layer. Furthermore, if a new client wants to classify a new class in one of the existing tasks, FedOpenHAR allows training to begin directly from the task-specific layers.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}