{"title":"On Minimizers and Convolutional Filters: Theoretical Connections and Applications to Genome Analysis.","authors":"Yun William Yu","doi":"10.1089/cmb.2024.0483","DOIUrl":"10.1089/cmb.2024.0483","url":null,"abstract":"<p><p>\u0000 <b>Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.<sup></sup></b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"381-395"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140870311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Singular Value Decomposition-Based Penalized Multinomial Regression for Classifying Imbalanced Medulloblastoma Subgroups Using Methylation Data.","authors":"Isra Mohammed, Murtada K Elbashir, Areeg S Faggad","doi":"10.1089/cmb.2023.0198","DOIUrl":"10.1089/cmb.2023.0198","url":null,"abstract":"<p><p><b>Medulloblastoma (MB) is a molecularly heterogeneous brain malignancy with large differences in clinical presentation. According to genomic studies, there are at least four distinct molecular subgroups of MB: sonic hedgehog (SHH), wingless/INT (WNT), Group 3, and Group 4. The treatment and outcomes depend on appropriate classification. It is difficult for the classification algorithms to identify these subgroups from an imbalanced MB genomic data set, where the distribution of samples among the MB subgroups may not be equal. To overcome this problem, we used singular value decomposition (SVD) and group lasso techniques to find DNA methylation probe features that maximize the separation between the different imbalanced MB subgroups. We used multinomial regression as a classification method to classify the four different molecular subgroups of MB using the reduced DNA methylation data. Coordinate descent is used to solve our loss function associated with the group lasso, which promotes sparsity. By using SVD, we were able to reduce the 321,174 probe features to just 200 features. Less than 40 features were successfully selected after applying the group lasso, which we then used as predictors for our classification models. Our proposed method achieved an average overall accuracy of 99% based on fivefold cross-validation technique. Our approach produces improved classification performance compared with the state-of-the-art methods for classifying MB molecular subgroups</b>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"458-471"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140943774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jannatul Ferdous, George Matthew Fricke, Melanie E Moses
{"title":"More Is Faster: Why Population Size Matters in Biological Search.","authors":"Jannatul Ferdous, George Matthew Fricke, Melanie E Moses","doi":"10.1089/cmb.2023.0296","DOIUrl":"10.1089/cmb.2023.0296","url":null,"abstract":"<p><p>\u0000 <b>Many biological scenarios have multiple cooperating searchers, and the timing of the initial first contact between any one of those searchers and its target is critically important. However, we are unaware of biological models that predict how long it takes for the first of many searchers to discover a target. We present a novel mathematical model that predicts initial first contact times between searchers and targets distributed at random in a volume. We compare this model with the extreme first passage time approach in physics that assumes an infinite number of searchers all initially positioned at the same location. We explore how the number of searchers, the distribution of searchers and targets, and the initial distances between searchers and targets affect initial first contact times. Given a constant density of uniformly distributed searchers and targets, the initial first contact time decreases linearly with both search volume and the number of searchers. However, given only a single target and searchers placed at the same starting location, the relationship between the initial first contact time and the number of searchers shifts from a linear decrease to a logarithmic decrease as the number of searchers grows very large. More generally, we show that initial first contact times can be dramatically faster than the average first contact times and that the initial first contact times decrease with the number of searchers, while the average search times are independent of the number of searchers. We suggest that this is an underappreciated phenomenon in biology and other collective search problems.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"429-444"},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140957699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Floor Is Lava: Halving Natural Genomes with Viaducts, Piers, and Pontoons.","authors":"Leonard Bohnenkämper","doi":"10.1089/cmb.2023.0330","DOIUrl":"10.1089/cmb.2023.0330","url":null,"abstract":"<p><p><b>Whole Genome Duplications (WGDs) are events that double the content and structure of a genome. In some organisms, multiple WGD events have been observed while loss of genetic material is a typical occurrence following a WGD event. The requirement of classic rearrangement models that every genetic marker has to occur exactly two times in a given problem instance, therefore, poses a serious restriction in this context. The Double</b>-<b>Cut and Join (DCJ) model is a simple and powerful model for the analysis of large structural rearrangements. After being extended to the DCJ-Indel model, capable of handling gains and losses of genetic material, research has shifted in recent years toward enabling it to handle natural genomes, for which no assumption about the distribution of markers has to be made. The traditional theoretical framework for studying WGD events is the Genome Halving Problem (GHP). While the GHP is solved for the DCJ model for genomes without losses, there are currently no exact algorithms utilizing the DCJ-Indel model that are able to handle natural genomes. In this work, we present a general view on the DCJ-Indel model that we apply to derive an exact polynomial time and space solution for the GHP on genomes with at most two genes per family before generalizing the problem to an integer linear program solution for natural genomes.</b></p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"294-311"},"PeriodicalIF":1.4,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057688/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140848856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing the Bounds of the Number of Reticulations in a Tree-Child Network That Displays a Set of Trees.","authors":"Yufeng Wu, Louxin Zhang","doi":"10.1089/cmb.2023.0309","DOIUrl":"10.1089/cmb.2023.0309","url":null,"abstract":"<p><p>\u0000 <b>Phylogenetic network is an evolutionary model that uses a rooted directed acyclic graph (instead of a tree) to model an evolutionary history of species in which reticulate events (e.g., hybrid speciation or horizontal gene transfer) occurred. Tree-child network is a kind of phylogenetic network with structural constraints. Existing approaches for tree-child network reconstruction can be slow for large data. In this study, we present several computational approaches for bounding from below the number of reticulations in a tree-child network that displays a given set of rooted binary phylogenetic trees. In addition, we also present some theoretical results on bounding from above the number of reticulations. Through simulation, we demonstrate that the new lower bounds on the reticulation number for tree-child networks can practically be computed for large tree data. The bounds can provide estimates of reticulation for relatively large data.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"345-359"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139576061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanket Wagle, Alexey Markin, Paweł Górecki, Tavis K Anderson, Oliver Eulenstein
{"title":"Asymmetric Cluster-Based Measures for Comparative Phylogenetics.","authors":"Sanket Wagle, Alexey Markin, Paweł Górecki, Tavis K Anderson, Oliver Eulenstein","doi":"10.1089/cmb.2023.0338","DOIUrl":"10.1089/cmb.2023.0338","url":null,"abstract":"<p><p><b>Phylogenetic inference and reconstruction methods generate hypotheses on evolutionary history. Competing inference methods are frequently used, and the evaluation of the generated hypotheses is achieved using tree comparison costs. The Robinson</b>-<b>Foulds (RF) distance is a widely used cost to compare the topology of two trees, but this cost is sensitive to tree error and can overestimate tree differences. To overcome this limitation, a refined version of the RF distance called the Cluster Affinity (CA) distance was introduced. However, CA distances are symmetric and cannot compare different types of trees. These asymmetric comparisons occur when gene trees are compared with species trees, when disparate datasets are integrated into a supertree, or when tree comparison measures are used to infer a phylogenetic network. In this study, we introduce a relaxation of the original Affinity distance to compare heterogeneous trees called the asymmetric CA cost. We also develop a biologically interpretable cost, the Cluster Support cost that normalizes by cluster size across gene trees. The characteristics of these costs are similar to the symmetric CA cost. We describe efficient algorithms, derive the exact diameters, and use these to standardize the cost to be applicable in practice. These costs provide objective, fine-scale, and biologically interpretable values that can assess differences and similarities between phylogenetic trees.</b></p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"312-327"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057527/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140863219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kaari Landry, Olivier Tremblay-Savard, Manuel Lafond
{"title":"A Fixed-Parameter Tractable Algorithm for Finding Agreement Cherry-Reduced Subnetworks in Level-1 Orchard Networks.","authors":"Kaari Landry, Olivier Tremblay-Savard, Manuel Lafond","doi":"10.1089/cmb.2023.0317","DOIUrl":"10.1089/cmb.2023.0317","url":null,"abstract":"<p><p><b>Phylogenetic networks are increasingly being considered better suited to represent the complexity of the evolutionary relationships between species. One class of phylogenetic networks that have received a lot of attention recently is the class of orchard networks, which is composed of networks that can be reduced to a single leaf using cherry reductions. Cherry reductions, also called cherry-picking operations, remove either a leaf of a simple cherry (sibling leaves sharing a parent) or a reticulate edge of a reticulate cherry (two leaves whose parents are connected by a reticulate edge). In this article, we present a fixed-parameter tractable algorithm to solve the problem of finding a maximum agreement cherry-reduced subnetwork (MACRS) between two rooted binary level-1 networks. This is the first exact algorithm proposed to solve the MACRS problem. As proven in an earlier work, there is a direct relationship between finding an MACRS and calculating a distance based on cherry operations. As a result, the proposed algorithm also provides a distance that can be used for the comparison of level-1 networks</b>.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"360-379"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138830002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The <i>k</i>-Robinson-Foulds Dissimilarity Measures for Comparison of Labeled Trees.","authors":"Elahe Khayatian, Gabriel Valiente, Louxin Zhang","doi":"10.1089/cmb.2023.0312","DOIUrl":"10.1089/cmb.2023.0312","url":null,"abstract":"<p><p>\u0000 <b>Understanding the mutational history of tumor cells is a critical endeavor in unraveling the mechanisms that drive the onset and progression of cancer. Modeling tumor cell evolution with labeled trees motivates researchers to develop different measures to compare labeled trees. Although the Robinson-Foulds (RF) distance is widely used for comparing species trees, its applicability to labeled trees reveals certain limitations. This study introduces the <i>k</i>-RF dissimilarity measures, tailored to address the challenges of labeled tree comparison. The RF distance is succinctly expressed as <i>n</i>-RF in the space of labeled trees with <i>n</i> nodes. Like the RF distance, the <i>k</i>-RF is a pseudometric for multiset-labeled trees and becomes a metric in the space of 1-labeled trees. By setting <i>k</i> to a small value, the <i>k</i>-RF dissimilarity can capture analogous local regions in two labeled trees with different size or different labels.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"328-344"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11057537/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139564180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Orthology and Paralogy Relationships at Transcript Level.","authors":"Wend Yam D D Ouedraogo, Aida Ouangraoua","doi":"10.1089/cmb.2023.0400","DOIUrl":"10.1089/cmb.2023.0400","url":null,"abstract":"<p><p>\u0000 <b>Eukaryotic genes undergo a mechanism called alternative processing, resulting in transcriptome diversity by allowing the production of multiple distinct transcripts from a gene. More than half of human genes are affected, and the resulting transcripts are highly conserved among orthologous genes of distinct species. In this work, we present the definition of orthology and paralogy between transcripts of homologous genes, together with an algorithm to compute clusters of conserved orthologous and paralogous transcripts. Gene-level homology relationships are utilized to define various types of homology relationships between transcripts originating from the same ancestral transcript. A Reciprocal Best Hits approach is employed to infer clusters of isoorthologous and recent paralogous transcripts. We applied this method to transcripts from simulated gene families as well as real gene families from the Ensembl-Compara database. The results are consistent with those from previous studies that compared orthologous gene transcripts. Furthermore, our findings provide evidence that searching for conserved transcripts between homologous genes, beyond the scope of orthologous genes, is likely to yield valuable information.</b>\u0000 </p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 4","pages":"277-293"},"PeriodicalIF":1.7,"publicationDate":"2024-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140861411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}