{"title":"MMG4: Recognition of G4-Forming Sequences Based on Markov Model.","authors":"Boyuan Yu, Hao Zhang, Cong Pian, Yuanyuan Chen","doi":"10.1089/cmb.2024.0523","DOIUrl":"https://doi.org/10.1089/cmb.2024.0523","url":null,"abstract":"<p><p>G-quadruplexes (G4s) are special nucleic acid structures with various important biological functions. Existing tools and technologies for G4-forming sequences recognition are limited to time-consuming and costly methods such as circular dichroism and nuclear magnetic resonance. Developing a fast and accurate model for G4-forming sequences recognition has far-reaching significance. In this study, MMG4, a novel model to recognize G4-forming sequences based on Markov model (MM), was developed and the phenomenon of high recognition accuracy in the central region of the sequence and low accuracy in the two end regions was discovered. It was further found that the differences in base transfer probabilities, ratio distribution, and G4-motif structural content in different regions may be the causes of this phenomenon. The study also explored the impact of sequence length on recognition accuracy and found the optimal recognition interval to be [910-1049], with the highest recognition accuracy reaching 85.95%. By extracting sequence features, the study constructed three types of machine learning models: random forest (RF), support vector machine, and back-propagation neural network. It was found that recognition performance of MM was significantly better than that of the other three machine learning models, proving that the recognition method based on MM can effectively capture the correlation information between adjacent nucleotides of G4. By combining MM with the three machine learning models, the predictive performance of MMG4 improved. Among them, the RF model combined with MM has the best performance, achieving an area under the receiver operating characteristic curve value of 0.93 and an area under the precision-recall curve value of 0.9. Finally, the study validated the model robustness and generalization ability through independent testing dataset.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lin Zhang, Jigen Peng, Yongbin Ge, Haiyang Li, Yuchao Tang
{"title":"High-Accuracy Positivity-Preserving Finite Difference Approximations of the Chemotaxis Model for Tumor Invasion.","authors":"Lin Zhang, Jigen Peng, Yongbin Ge, Haiyang Li, Yuchao Tang","doi":"10.1089/cmb.2023.0316","DOIUrl":"10.1089/cmb.2023.0316","url":null,"abstract":"<p><p>Numerical simulation of the complex evolution process for tumor invasion plays an extremely important role in-depth exploring the bio-taxis phenomena of tumor growth and metastasis. In view of the fact that low-accuracy numerical methods often have large errors and low resolution, very refined grids have to be used if we want to get high-resolution simulating results, which leads to a great deal of computational cost. In this paper, we are committed to developing a class of high-accuracy positivity-preserving finite difference methods to solve the chemotaxis model for tumor invasion. First, two unconditionally stable implicit compact difference schemes for solving the model are proposed; second, the local truncation errors of the new schemes are analyzed, which show that they have second-order accuracy in time and fourth-order accuracy in space; third, based on the proposed schemes, the high-accuracy numerical integration idea of binary functions is employed to structure a linear compact weighting formula that guarantees fourth-order accuracy and nonnegative, and then a positivity-preserving and time-marching algorithm is established; and finally, the accuracy, stability, and positivity-preserving of the proposed methods are verified by several numerical experiments, and the evolution phenomena of tumor invasion over time are numerically simulated and analyzed.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142380968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel
{"title":"Correcting for Observation Bias in Cancer Progression Modeling.","authors":"Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel","doi":"10.1089/cmb.2024.0666","DOIUrl":"10.1089/cmb.2024.0666","url":null,"abstract":"<p><p>Tumor progression is driven by the accumulation of genetic alterations, including both point mutations and copy number changes. Understanding the temporal sequence of these events is crucial for comprehending the disease but is not directly discernible from cross-sectional genomic data. Cancer progression models, including Mutual Hazard Networks (MHNs), aim to reconstruct the dynamics of tumor progression by learning the causal interactions between genetic events based on their co-occurrence patterns in cross-sectional data. Here, we highlight a commonly overlooked bias in cross-sectional datasets that can distort progression modeling. Tumors become clinically detectable when they cause symptoms or are identified through imaging or tests. Detection factors, such as size, inflammation (fever, fatigue), and elevated biochemical markers, are influenced by genomic alterations. Ignoring these effects leads to \"conditioning on a collider\" bias, where events making the tumor more observable appear anticorrelated, creating false suppressive effects or masking promoting effects among genetic events. We enhance MHNs by incorporating the effects of genetic progression events on the inclusion of a tumor in a dataset, thus correcting for collider bias. We derive an efficient tensor formula for the likelihood function and apply it to two datasets from the MSK-IMPACT study. In colon adenocarcinoma, we observe a significantly higher rate of clinical detection for TP53-positive tumors, while in lung adenocarcinoma, the same is true for EGFR-positive tumors. Compared to classical MHNs, this approach eliminates several spurious suppressive interactions and uncovers multiple promoting effects.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 10","pages":"927-945"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate IsoRank for Scalable and Functionally Meaningful Cross-Species Alignments of Protein Interaction Networks.","authors":"Kapil Devkota, Anselm Blumer, Xiaozhe Hu, Lenore Cowen","doi":"10.1089/cmb.2024.0673","DOIUrl":"10.1089/cmb.2024.0673","url":null,"abstract":"<p><p>The IsoRank algorithm of Singh, Xu, and Berger was a pioneering algorithmic advance that applied spectral methods to the problem of cross-species global alignment of biological networks. We develop a new IsoRank approximation that exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two protein-protein interaction (PPI) networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality, our approximation performs better than the exact IsoRank, which is doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms. We also present an analogous approximation to IsoRankN, which extends the network alignment to more than two species.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"990-1007"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier
{"title":"Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.","authors":"Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier","doi":"10.1089/cmb.2024.0664","DOIUrl":"10.1089/cmb.2024.0664","url":null,"abstract":"<p><p>This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of <i>k</i> errors. While existing literature offers designed schemes for up to <i>k</i> = 4 errors, designing search schemes for larger <i>k</i> values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to <i>k</i> = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher <i>k</i> values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"975-989"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Optimal Metabolic Factories.","authors":"Spencer Krieger, John Kececioglu","doi":"10.1089/cmb.2024.0748","DOIUrl":"10.1089/cmb.2024.0748","url":null,"abstract":"<p><p>Perhaps the most fundamental model in synthetic and systems biology for inferring pathways in metabolic reaction networks is a metabolic <i>factory</i>: a system of reactions that starts from a set of source compounds and produces a set of target molecules, while conserving or not depleting intermediate metabolites. Finding a shortest factory-that minimizes a sum of real-valued weights on its reactions to infer the most likely pathway-is NP-complete. The current state-of-the-art for shortest factories solves a mixed-integer linear program with a major drawback: it requires the user to set a critical parameter, where too large a value can make optimal solutions infeasible, while too small a value can yield degenerate solutions due to numerical error. We present the first <i>robust algorithm</i> for optimal factories that is both <i>parameter-free</i> (relieving the user from determining a parameter setting) and <i>degeneracy-free</i> (guaranteeing it finds an optimal nondegenerate solution). We also give for the first time a <i>complete characterization</i> of the graph-theoretic structure of shortest factories, that reveals an important class of degenerate solutions which was overlooked and potentially output by the prior state-of-the-art.We show degeneracy is precisely due to <i>invalid stoichiometries</i> in reactions, and provide an efficient algorithm for identifying all such <i>misannotations</i> in a metabolic network. In addition we settle the relationship between the two established pathway models of <i>hyperpaths</i> and factories by proving hyperpaths actually comprise a <i>subclass</i> of factories. Comprehensive experiments over all instances from the standard metabolic reaction databases in the literature demonstrate our parameter-free exact algorithm is <i>fast in practice</i>, quickly finding optimal factories in large real-world networks containing thousands of reactions. A preliminary implementation of our robust algorithm for shortest factories in a new tool called Freeia is available free for research use at http://freeia.cs.arizona.edu.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1045-1086"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford
{"title":"Approximate and Exact Optimization Algorithms for the Beltway and Turnpike Problems with Duplicated, Missing, Partially Labeled, and Uncertain Measurements.","authors":"C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford","doi":"10.1089/cmb.2024.0661","DOIUrl":"10.1089/cmb.2024.0661","url":null,"abstract":"<p><p>The Turnpike problem aims to reconstruct a set of one-dimensional points from their unordered pairwise distances. Turnpike arises in biological applications such as molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes. Under noisy observation of the distances, the Turnpike problem is NP-hard and can take exponential time and space to solve when using traditional algorithms. To address this, we reframe the noisy Turnpike problem through the lens of optimization, seeking to simultaneously find the unknown point set and a permutation that maximizes similarity to the input distances. Our core contribution is a suite of algorithms that robustly solve this new objective. This includes a bilevel optimization framework that can efficiently solve Turnpike instances with up to 100,000 points. We show that this framework can be extended to scenarios with domain-specific constraints that include duplicated, missing, and partially labeled distances. Using these, we also extend our algorithms to work for points distributed on a circle (the Beltway problem). For small-scale applications that require global optimality, we formulate an integer linear program (ILP) that (i) accepts an objective from a generic family of convex functions and (ii) uses an extended formulation to reduce the number of binary variables. On synthetic and real partial digest data, our bilevel algorithms achieved state-of-the-art scalability across challenging scenarios with performance that matches or exceeds competing baselines. On small-scale instances, our ILP efficiently recovered ground-truth assignments and produced reconstructions that match or exceed our alternating algorithms. Our implementations are available at https://github.com/Kingsford-Group/turnpikesolvermm.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"908-926"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henry Childs, Nathan Guerin, Pei Zhou, Bruce R Donald
{"title":"Protocol for Designing <i>De Novo</i> Noncanonical Peptide Binders in OSPREY.","authors":"Henry Childs, Nathan Guerin, Pei Zhou, Bruce R Donald","doi":"10.1089/cmb.2024.0669","DOIUrl":"10.1089/cmb.2024.0669","url":null,"abstract":"<p><p>D-peptides, the mirror image of canonical L-peptides, offer numerous biological advantages that make them effective therapeutics. This article details how to use DexDesign, the newest OSPREY-based algorithm, for designing these D-peptides <i>de novo</i>. OSPREY physics-based models precisely mimic energy-equivariant reflection operations, enabling the generation of D-peptide scaffolds from L-peptide templates. Due to the scarcity of D-peptide:L-protein structural data, DexDesign calls a geometric hashing algorithm, Method of Accelerated Search for Tertiary Ensemble Representatives, as a subroutine to produce a synthetic structural dataset. DexDesign enables mixed-chirality designs with a new user interface and also reduces the conformation and sequence search space using three new design techniques: Minimum Flexible Set, Inverse Alanine Scanning, and K*-based Mutational Scanning.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"965-974"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142371980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro
{"title":"Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs<sup />.","authors":"Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro","doi":"10.1089/cmb.2024.0714","DOIUrl":"10.1089/cmb.2024.0714","url":null,"abstract":"<p><p>We describe lossless compressed data structures for the <i>colored</i> de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from <i>k</i>-mers to their <i>color sets</i>. The color set of a <i>k</i>-mer is the set of all identifiers, or <i>colors</i>, of the references that contain the <i>k</i>-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1022-1044"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}