Letu Qingge, Kushal Badal, Richard Annan, Jordan Sturtz, Xiaowen Liu, Binhai Zhu
{"title":"Generative AI Models for the Protein Scaffold Filling Problem.","authors":"Letu Qingge, Kushal Badal, Richard Annan, Jordan Sturtz, Xiaowen Liu, Binhai Zhu","doi":"10.1089/cmb.2024.0510","DOIUrl":"https://doi.org/10.1089/cmb.2024.0510","url":null,"abstract":"<p><p>De novo protein sequencing is an important problem in proteomics, playing a crucial role in understanding protein functions, drug discovery, design and evolutionary studies, etc. Top-down and bottom-up tandem mass spectrometry are popular approaches used in the field of mass spectrometry to analyze and sequence proteins. However, these approaches often produce incomplete protein sequences with gaps, namely scaffolds. The protein scaffold filling problem refers to filling the missing amino acids in the gaps of a scaffold to infer the complete protein sequence. In this article, we tackle the protein scaffold filling problem based on generative AI techniques, such as convolutional denoising autoencoder, transformer, and generative pretrained transformer (GPT) models, to complete the protein sequences and compare our results with recently developed convolutional long short-term memory-based sequence model. We evaluate the model performance both on a real dataset and generated datasets. All proposed models show outstanding prediction accuracy. Notably, the GPT-2 model achieves 100% gap-filling accuracy and 100% full sequence accuracy on the MabCampth protein scaffold, which outperforms the other models.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142501311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Attention-Guided Residual U-Net with SE Connection and ASPP for Watershed-Based Cell Segmentation in Microscopy Images.","authors":"Jovial Niyogisubizo, Keliang Zhao, Jintao Meng, Yi Pan, Rosiyadi Didi, Yanjie Wei","doi":"10.1089/cmb.2023.0446","DOIUrl":"https://doi.org/10.1089/cmb.2023.0446","url":null,"abstract":"<p><p>Time-lapse microscopy imaging is a crucial technique in biomedical studies for observing cellular behavior over time, providing essential data on cell numbers, sizes, shapes, and interactions. Manual analysis of hundreds or thousands of cells is impractical, necessitating the development of automated cell segmentation approaches. Traditional image processing methods have made significant progress in this area, but the advent of deep learning methods, particularly those using U-Net-based networks, has further enhanced performance in medical and microscopy image segmentation. However, challenges remain, particularly in accurately segmenting touching cells in images with low signal-to-noise ratios. Existing methods often struggle with effectively integrating features across different levels of abstraction. This can lead to model confusion, particularly when important contextual information is lost or the features are not adequately distinguished. The challenge lies in appropriately combining these features to preserve critical details while ensuring robust and accurate segmentation. To address these issues, we propose a novel framework called RA-SE-ASPP-Net, which incorporates Residual Blocks, Attention Mechanism, Squeeze-and-Excitation connection, and Atrous Spatial Pyramid Pooling to achieve precise and robust cell segmentation. We evaluate our proposed architecture using an induced pluripotent stem cell reprogramming dataset, a challenging dataset that has received limited attention in this field. Additionally, we compare our model with different ablation experiments to demonstrate its robustness. The proposed architecture outperforms the baseline models in all evaluated metrics, providing the most accurate semantic segmentation results. Finally, we applied the watershed method to the semantic segmentation results to obtain precise segmentations with specific information for each cell.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel
{"title":"Correcting for Observation Bias in Cancer Progression Modeling.","authors":"Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel","doi":"10.1089/cmb.2024.0666","DOIUrl":"10.1089/cmb.2024.0666","url":null,"abstract":"<p><p>Tumor progression is driven by the accumulation of genetic alterations, including both point mutations and copy number changes. Understanding the temporal sequence of these events is crucial for comprehending the disease but is not directly discernible from cross-sectional genomic data. Cancer progression models, including Mutual Hazard Networks (MHNs), aim to reconstruct the dynamics of tumor progression by learning the causal interactions between genetic events based on their co-occurrence patterns in cross-sectional data. Here, we highlight a commonly overlooked bias in cross-sectional datasets that can distort progression modeling. Tumors become clinically detectable when they cause symptoms or are identified through imaging or tests. Detection factors, such as size, inflammation (fever, fatigue), and elevated biochemical markers, are influenced by genomic alterations. Ignoring these effects leads to \"conditioning on a collider\" bias, where events making the tumor more observable appear anticorrelated, creating false suppressive effects or masking promoting effects among genetic events. We enhance MHNs by incorporating the effects of genetic progression events on the inclusion of a tumor in a dataset, thus correcting for collider bias. We derive an efficient tensor formula for the likelihood function and apply it to two datasets from the MSK-IMPACT study. In colon adenocarcinoma, we observe a significantly higher rate of clinical detection for TP53-positive tumors, while in lung adenocarcinoma, the same is true for EGFR-positive tumors. Compared to classical MHNs, this approach eliminates several spurious suppressive interactions and uncovers multiple promoting effects.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 10","pages":"927-945"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Approximate IsoRank for Scalable and Functionally Meaningful Cross-Species Alignments of Protein Interaction Networks.","authors":"Kapil Devkota, Anselm Blumer, Xiaozhe Hu, Lenore Cowen","doi":"10.1089/cmb.2024.0673","DOIUrl":"10.1089/cmb.2024.0673","url":null,"abstract":"<p><p>The IsoRank algorithm of Singh, Xu, and Berger was a pioneering algorithmic advance that applied spectral methods to the problem of cross-species global alignment of biological networks. We develop a new IsoRank approximation that exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two protein-protein interaction (PPI) networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality, our approximation performs better than the exact IsoRank, which is doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms. We also present an analogous approximation to IsoRankN, which extends the network alignment to more than two species.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"990-1007"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Optimal Metabolic Factories.","authors":"Spencer Krieger, John Kececioglu","doi":"10.1089/cmb.2024.0748","DOIUrl":"10.1089/cmb.2024.0748","url":null,"abstract":"<p><p>Perhaps the most fundamental model in synthetic and systems biology for inferring pathways in metabolic reaction networks is a metabolic <i>factory</i>: a system of reactions that starts from a set of source compounds and produces a set of target molecules, while conserving or not depleting intermediate metabolites. Finding a shortest factory-that minimizes a sum of real-valued weights on its reactions to infer the most likely pathway-is NP-complete. The current state-of-the-art for shortest factories solves a mixed-integer linear program with a major drawback: it requires the user to set a critical parameter, where too large a value can make optimal solutions infeasible, while too small a value can yield degenerate solutions due to numerical error. We present the first <i>robust algorithm</i> for optimal factories that is both <i>parameter-free</i> (relieving the user from determining a parameter setting) and <i>degeneracy-free</i> (guaranteeing it finds an optimal nondegenerate solution). We also give for the first time a <i>complete characterization</i> of the graph-theoretic structure of shortest factories, that reveals an important class of degenerate solutions which was overlooked and potentially output by the prior state-of-the-art.We show degeneracy is precisely due to <i>invalid stoichiometries</i> in reactions, and provide an efficient algorithm for identifying all such <i>misannotations</i> in a metabolic network. In addition we settle the relationship between the two established pathway models of <i>hyperpaths</i> and factories by proving hyperpaths actually comprise a <i>subclass</i> of factories. Comprehensive experiments over all instances from the standard metabolic reaction databases in the literature demonstrate our parameter-free exact algorithm is <i>fast in practice</i>, quickly finding optimal factories in large real-world networks containing thousands of reactions. A preliminary implementation of our robust algorithm for shortest factories in a new tool called Freeia is available free for research use at http://freeia.cs.arizona.edu.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1045-1086"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier
{"title":"Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.","authors":"Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier","doi":"10.1089/cmb.2024.0664","DOIUrl":"10.1089/cmb.2024.0664","url":null,"abstract":"<p><p>This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of <i>k</i> errors. While existing literature offers designed schemes for up to <i>k</i> = 4 errors, designing search schemes for larger <i>k</i> values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to <i>k</i> = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher <i>k</i> values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"975-989"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford
{"title":"Approximate and Exact Optimization Algorithms for the Beltway and Turnpike Problems with Duplicated, Missing, Partially Labeled, and Uncertain Measurements.","authors":"C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford","doi":"10.1089/cmb.2024.0661","DOIUrl":"10.1089/cmb.2024.0661","url":null,"abstract":"<p><p>The Turnpike problem aims to reconstruct a set of one-dimensional points from their unordered pairwise distances. Turnpike arises in biological applications such as molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes. Under noisy observation of the distances, the Turnpike problem is NP-hard and can take exponential time and space to solve when using traditional algorithms. To address this, we reframe the noisy Turnpike problem through the lens of optimization, seeking to simultaneously find the unknown point set and a permutation that maximizes similarity to the input distances. Our core contribution is a suite of algorithms that robustly solve this new objective. This includes a bilevel optimization framework that can efficiently solve Turnpike instances with up to 100,000 points. We show that this framework can be extended to scenarios with domain-specific constraints that include duplicated, missing, and partially labeled distances. Using these, we also extend our algorithms to work for points distributed on a circle (the Beltway problem). For small-scale applications that require global optimality, we formulate an integer linear program (ILP) that (i) accepts an objective from a generic family of convex functions and (ii) uses an extended formulation to reduce the number of binary variables. On synthetic and real partial digest data, our bilevel algorithms achieved state-of-the-art scalability across challenging scenarios with performance that matches or exceeds competing baselines. On small-scale instances, our ILP efficiently recovered ground-truth assignments and produced reconstructions that match or exceed our alternating algorithms. Our implementations are available at https://github.com/Kingsford-Group/turnpikesolvermm.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"908-926"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Henry Childs, Nathan Guerin, Pei Zhou, Bruce R Donald
{"title":"Protocol for Designing <i>De Novo</i> Noncanonical Peptide Binders in OSPREY.","authors":"Henry Childs, Nathan Guerin, Pei Zhou, Bruce R Donald","doi":"10.1089/cmb.2024.0669","DOIUrl":"10.1089/cmb.2024.0669","url":null,"abstract":"<p><p>D-peptides, the mirror image of canonical L-peptides, offer numerous biological advantages that make them effective therapeutics. This article details how to use DexDesign, the newest OSPREY-based algorithm, for designing these D-peptides <i>de novo</i>. OSPREY physics-based models precisely mimic energy-equivariant reflection operations, enabling the generation of D-peptide scaffolds from L-peptide templates. Due to the scarcity of D-peptide:L-protein structural data, DexDesign calls a geometric hashing algorithm, Method of Accelerated Search for Tertiary Ensemble Representatives, as a subroutine to produce a synthetic structural dataset. DexDesign enables mixed-chirality designs with a new user interface and also reduces the conformation and sequence search space using three new design techniques: Minimum Flexible Set, Inverse Alanine Scanning, and K*-based Mutational Scanning.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"965-974"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698684/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142371980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro
{"title":"Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs<sup />.","authors":"Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro","doi":"10.1089/cmb.2024.0714","DOIUrl":"10.1089/cmb.2024.0714","url":null,"abstract":"<p><p>We describe lossless compressed data structures for the <i>colored</i> de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from <i>k</i>-mers to their <i>color sets</i>. The color set of a <i>k</i>-mer is the set of all identifiers, or <i>colors</i>, of the references that contain the <i>k</i>-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1022-1044"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}