Juan Xia, Yuxia Li, Haotian Zhu, Feiyang Xue, Feng Shi, Nana Li
{"title":"A Bayesian Change Point Model for Dynamic Alternative Transcription Start Site Usage During Cellular Differentiation.","authors":"Juan Xia, Yuxia Li, Haotian Zhu, Feiyang Xue, Feng Shi, Nana Li","doi":"10.1089/cmb.2023.0174","DOIUrl":"10.1089/cmb.2023.0174","url":null,"abstract":"<p><p><b>ABSTRACT</b> <b>An alternative transcription start site (ATSS) is a major driving force for increasing the complexity of transcripts in human tissues. As a transcriptional regulatory mechanism, ATSS has biological significance. Many studies have confirmed that ATSS plays an important role in diseases and cell development and differentiation. However, exploration of its dynamic mechanisms remains insufficient. Identifying ATSS change points during cell differentiation is critical for elucidating potential dynamic mechanisms. For relative ATSS usage as percentage data, the existing methods lack sensitivity to detect the change point for ATSS longitudinal data. In addition, some methods have strict requirements for data distribution and cannot be applied to deal with this problem. In this study, the Bayesian change point detection model was first constructed using reparameterization techniques for two parameters of a beta distribution for the percentage data type, and the posterior distributions of parameters and change points were obtained using Markov Chain Monte Carlo (MCMC) sampling. With comprehensive simulation studies, the performance of the Bayesian change point detection model is found to be consistently powerful and robust across most scenarios with different sample sizes and beta distributions. Second, differential ATSS events in the real data, whose change points were identified using our method, were clustered according to their change points. Last, for each change point, pathway and transcription factor motif analyses were performed on its differential ATSS events. The results of our analyses demonstrated the effectiveness of the Bayesian change point detection model and provided biological insights into cell differentiation</b>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"445-457"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140943652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enforcing Temporal Consistency in Migration History Inference.","authors":"Mrinmoy Saha Roddur, Sagi Snir, Mohammed El-Kebir","doi":"10.1089/cmb.2023.0352","DOIUrl":"10.1089/cmb.2023.0352","url":null,"abstract":"<p><p>\u0000 <b>In addition to undergoing evolution, members of biological populations may also migrate between locations. Examples include the spread of tumor cells from the primary tumor to distant metastases or the spread of pathogens from one host to another. One may represent migration histories by assigning a location label to each vertex of a given phylogenetic tree such that an edge connecting vertices with distinct locations represents a migration. Some biological populations undergo comigration, a phenomenon where multiple taxa from distinct lineages simultaneously comigrate from one location to another. In this work, we show that a previous problem statement for inferring migration histories that are parsimonious in terms of migrations and comigrations may lead to temporally inconsistent solutions. To remedy this deficiency, we introduce precise definitions of temporal consistency of comigrations in a phylogenetic tree, leading to three successive problems. First, we formulate the temporally consistent comigration problem to check if a set of comigrations is temporally consistent and provide a linear time algorithm for solving this problem. Second, we formulate the parsimonious consistent comigrations (PCC) problem, which aims to find comigrations given a location labeling of a phylogenetic tree. We show that PCC is NP-hard. Third, we formulate the parsimonious consistent comigration history (PCCH) problem, which infers the migration history given a phylogenetic tree and locations of its extant vertices only. We show that PCCH is NP-hard as well. On the positive side, we propose integer linear programming models to solve the PCC and PCCH problems. We demonstrate our algorithms on simulated and real data.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"396-415"},"PeriodicalIF":1.4,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140957697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claudio Arbib, Andrea D'ascenzo, Fabrizio Rossi, Daniele Santoni
{"title":"An Integer Linear Programming Model to Optimize Coding DNA Sequences By Joint Control of Transcript Indicators.","authors":"Claudio Arbib, Andrea D'ascenzo, Fabrizio Rossi, Daniele Santoni","doi":"10.1089/cmb.2023.0166","DOIUrl":"10.1089/cmb.2023.0166","url":null,"abstract":"<p><p>\u0000 <b>A <i>Coding DNA Sequence</i> (CDS) is a fraction of DNA whose nucleotides are grouped into consecutive triplets called codons, each one encoding an amino acid. Because most amino acids can be encoded by more than one codon, the same amino acid chain can be obtained by a very large number of different CDSs. These synonymous CDSs show different features that, also depending on the organism the transcript is expressed in, could affect translational efficiency and yield. The identification of optimal CDSs with respect to given transcript indicators is in general a challenging task, but it has been observed in recent literature that integer linear programming (ILP) can be a very flexible and efficient way to achieve it. In this article, we add evidence to this observation by proposing a new ILP model that simultaneously optimizes different well-grounded indicators. With this model, we efficiently find solutions that dominate those returned by six existing codon optimization heuristics.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"416-428"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140851945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Minimizers and Convolutional Filters: Theoretical Connections and Applications to Genome Analysis.","authors":"Yun William Yu","doi":"10.1089/cmb.2024.0483","DOIUrl":"10.1089/cmb.2024.0483","url":null,"abstract":"<p><p>\u0000 <b>Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.<sup></sup></b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"381-395"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140870311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Singular Value Decomposition-Based Penalized Multinomial Regression for Classifying Imbalanced Medulloblastoma Subgroups Using Methylation Data.","authors":"Isra Mohammed, Murtada K Elbashir, Areeg S Faggad","doi":"10.1089/cmb.2023.0198","DOIUrl":"10.1089/cmb.2023.0198","url":null,"abstract":"<p><p><b>Medulloblastoma (MB) is a molecularly heterogeneous brain malignancy with large differences in clinical presentation. According to genomic studies, there are at least four distinct molecular subgroups of MB: sonic hedgehog (SHH), wingless/INT (WNT), Group 3, and Group 4. The treatment and outcomes depend on appropriate classification. It is difficult for the classification algorithms to identify these subgroups from an imbalanced MB genomic data set, where the distribution of samples among the MB subgroups may not be equal. To overcome this problem, we used singular value decomposition (SVD) and group lasso techniques to find DNA methylation probe features that maximize the separation between the different imbalanced MB subgroups. We used multinomial regression as a classification method to classify the four different molecular subgroups of MB using the reduced DNA methylation data. Coordinate descent is used to solve our loss function associated with the group lasso, which promotes sparsity. By using SVD, we were able to reduce the 321,174 probe features to just 200 features. Less than 40 features were successfully selected after applying the group lasso, which we then used as predictors for our classification models. Our proposed method achieved an average overall accuracy of 99% based on fivefold cross-validation technique. Our approach produces improved classification performance compared with the state-of-the-art methods for classifying MB molecular subgroups</b>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"458-471"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140943774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jannatul Ferdous, George Matthew Fricke, Melanie E Moses
{"title":"More Is Faster: Why Population Size Matters in Biological Search.","authors":"Jannatul Ferdous, George Matthew Fricke, Melanie E Moses","doi":"10.1089/cmb.2023.0296","DOIUrl":"10.1089/cmb.2023.0296","url":null,"abstract":"<p><p>\u0000 <b>Many biological scenarios have multiple cooperating searchers, and the timing of the initial first contact between any one of those searchers and its target is critically important. However, we are unaware of biological models that predict how long it takes for the first of many searchers to discover a target. We present a novel mathematical model that predicts initial first contact times between searchers and targets distributed at random in a volume. We compare this model with the extreme first passage time approach in physics that assumes an infinite number of searchers all initially positioned at the same location. We explore how the number of searchers, the distribution of searchers and targets, and the initial distances between searchers and targets affect initial first contact times. Given a constant density of uniformly distributed searchers and targets, the initial first contact time decreases linearly with both search volume and the number of searchers. However, given only a single target and searchers placed at the same starting location, the relationship between the initial first contact time and the number of searchers shifts from a linear decrease to a logarithmic decrease as the number of searchers grows very large. More generally, we show that initial first contact times can be dramatically faster than the average first contact times and that the initial first contact times decrease with the number of searchers, while the average search times are independent of the number of searchers. We suggest that this is an underappreciated phenomenon in biology and other collective search problems.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"429-444"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140957699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Floor Is Lava: Halving Natural Genomes with Viaducts, Piers, and Pontoons.","authors":"Leonard Bohnenkämper","doi":"10.1089/cmb.2023.0330","DOIUrl":"10.1089/cmb.2023.0330","url":null,"abstract":"<p><p><b>Whole Genome Duplications (WGDs) are events that double the content and structure of a genome. In some organisms, multiple WGD events have been observed while loss of genetic material is a typical occurrence following a WGD event. The requirement of classic rearrangement models that every genetic marker has to occur exactly two times in a given problem instance, therefore, poses a serious restriction in this context. The Double</b>-<b>Cut and Join (DCJ) model is a simple and powerful model for the analysis of large structural rearrangements. After being extended to the DCJ-Indel model, capable of handling gains and losses of genetic material, research has shifted in recent years toward enabling it to handle natural genomes, for which no assumption about the distribution of markers has to be made. The traditional theoretical framework for studying WGD events is the Genome Halving Problem (GHP). While the GHP is solved for the DCJ model for genomes without losses, there are currently no exact algorithms utilizing the DCJ-Indel model that are able to handle natural genomes. In this work, we present a general view on the DCJ-Indel model that we apply to derive an exact polynomial time and space solution for the GHP on genomes with at most two genes per family before generalizing the problem to an integer linear program solution for natural genomes.</b></p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"294-311"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140848856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing the Bounds of the Number of Reticulations in a Tree-Child Network That Displays a Set of Trees.","authors":"Yufeng Wu, Louxin Zhang","doi":"10.1089/cmb.2023.0309","DOIUrl":"10.1089/cmb.2023.0309","url":null,"abstract":"<p><p>\u0000 <b>Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for large data. In this study, we present several computational approaches for bounding from below the number of reticulations in a tree-child network that displays a given set of rooted binary phylogenetic trees. In addition, we also present some theoretical results on bounding from above the number of reticulations. Through simulation, we demonstrate that the new lower bounds on the reticulation number for tree-child networks can practically be computed for large tree data. The bounds can provide estimates of reticulation for relatively large data.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"345-359"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139576061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanket Wagle, Alexey Markin, Paweł Górecki, Tavis K Anderson, Oliver Eulenstein
{"title":"Asymmetric Cluster-Based Measures for Comparative Phylogenetics.","authors":"Sanket Wagle, Alexey Markin, Paweł Górecki, Tavis K Anderson, Oliver Eulenstein","doi":"10.1089/cmb.2023.0338","DOIUrl":"10.1089/cmb.2023.0338","url":null,"abstract":"<p><p><b>Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson</b>-<b>Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.</b></p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"312-327"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140863219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}