Journal of Computational Biology最新文献_第7页

Correcting for Observation Bias in Cancer Progression Modeling. 纠正癌症进展模型中的观察偏差

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 DOI: 10.1089/cmb.2024.0666

Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel

{"title":"Correcting for Observation Bias in Cancer Progression Modeling.","authors":"Rudolf Schill, Maren Klever, Andreas Lösch, Y Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel","doi":"10.1089/cmb.2024.0666","DOIUrl":"10.1089/cmb.2024.0666","url":null,"abstract":"Tumor progression is driven by the accumulation of genetic alterations, including both point mutations and copy number changes. Understanding the temporal sequence of these events is crucial for comprehending the disease but is not directly discernible from cross-sectional genomic data. Cancer progression models, including Mutual Hazard Networks (MHNs), aim to reconstruct the dynamics of tumor progression by learning the causal interactions between genetic events based on their co-occurrence patterns in cross-sectional data. Here, we highlight a commonly overlooked bias in cross-sectional datasets that can distort progression modeling. Tumors become clinically detectable when they cause symptoms or are identified through imaging or tests. Detection factors, such as size, inflammation (fever, fatigue), and elevated biochemical markers, are influenced by genomic alterations. Ignoring these effects leads to \"conditioning on a collider\" bias, where events making the tumor more observable appear anticorrelated, creating false suppressive effects or masking promoting effects among genetic events. We enhance MHNs by incorporating the effects of genetic progression events on the inclusion of a tumor in a dataset, thus correcting for collider bias. We derive an efficient tensor formula for the likelihood function and apply it to two datasets from the MSK-IMPACT study. In colon adenocarcinoma, we observe a significantly higher rate of clinical detection for TP53-positive tumors, while in lung adenocarcinoma, the same is true for EGFR-positive tumors. Compared to classical MHNs, this approach eliminates several spurious suppressive interactions and uncovers multiple promoting effects.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":"31 10","pages":"927-945"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142545770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approximate IsoRank for Scalable and Functionally Meaningful Cross-Species Alignments of Protein Interaction Networks. 用于蛋白质相互作用网络的可扩展和有功能意义的跨物种对齐的近似 IsoRank。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-09-24 DOI: 10.1089/cmb.2024.0673

Kapil Devkota, Anselm Blumer, Xiaozhe Hu, Lenore Cowen

{"title":"Approximate IsoRank for Scalable and Functionally Meaningful Cross-Species Alignments of Protein Interaction Networks.","authors":"Kapil Devkota, Anselm Blumer, Xiaozhe Hu, Lenore Cowen","doi":"10.1089/cmb.2024.0673","DOIUrl":"10.1089/cmb.2024.0673","url":null,"abstract":"The IsoRank algorithm of Singh, Xu, and Berger was a pioneering algorithmic advance that applied spectral methods to the problem of cross-species global alignment of biological networks. We develop a new IsoRank approximation that exploits the mathematical properties of IsoRank's linear system to solve the problem in quadratic time with respect to the maximum size of the two protein-protein interaction (PPI) networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality, our approximation performs better than the exact IsoRank, which is doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms. We also present an analogous approximation to IsoRankN, which extends the network alignment to more than two species.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"990-1007"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RECOMB 2024 Special Issue. RECOMB 2024 特刊。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-09-20 DOI: 10.1089/cmb.2024.0809

Jian Ma, Mona Singh

引用次数: 0

Robust Optimal Metabolic Factories. 稳健的最佳代谢工厂

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-09-27 DOI: 10.1089/cmb.2024.0748

Spencer Krieger, John Kececioglu

{"title":"Robust Optimal Metabolic Factories.","authors":"Spencer Krieger, John Kececioglu","doi":"10.1089/cmb.2024.0748","DOIUrl":"10.1089/cmb.2024.0748","url":null,"abstract":"Perhaps the most fundamental model in synthetic and systems biology for inferring pathways in metabolic reaction networks is a metabolic factory: a system of reactions that starts from a set of source compounds and produces a set of target molecules, while conserving or not depleting intermediate metabolites. Finding a shortest factory-that minimizes a sum of real-valued weights on its reactions to infer the most likely pathway-is NP-complete. The current state-of-the-art for shortest factories solves a mixed-integer linear program with a major drawback: it requires the user to set a critical parameter, where too large a value can make optimal solutions infeasible, while too small a value can yield degenerate solutions due to numerical error. We present the first robust algorithm for optimal factories that is both parameter-free (relieving the user from determining a parameter setting) and degeneracy-free (guaranteeing it finds an optimal nondegenerate solution). We also give for the first time a complete characterization of the graph-theoretic structure of shortest factories, that reveals an important class of degenerate solutions which was overlooked and potentially output by the prior state-of-the-art.We show degeneracy is precisely due to invalid stoichiometries in reactions, and provide an efficient algorithm for identifying all such misannotations in a metabolic network. In addition we settle the relationship between the two established pathway models of hyperpaths and factories by proving hyperpaths actually comprise a subclass of factories. Comprehensive experiments over all instances from the standard metabolic reaction databases in the literature demonstrate our parameter-free exact algorithm is fast in practice, quickly finding optimal factories in large real-world networks containing thousands of reactions. A preliminary implementation of our robust algorithm for shortest factories in a new tool called Freeia is available free for research use at http://freeia.cs.arizona.edu.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1045-1086"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes. 无损近似模式匹配：高效搜索方案的自动设计

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-09-30 DOI: 10.1089/cmb.2024.0664

Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier

{"title":"Lossless Approximate Pattern Matching: Automated Design of Efficient Search Schemes.","authors":"Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier","doi":"10.1089/cmb.2024.0664","DOIUrl":"10.1089/cmb.2024.0664","url":null,"abstract":"This study introduces a pioneering approach to automate the creation of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches over a partitioned pattern. Each search specifies the processing order of these parts and the cumulative lower and upper bounds on the number of errors in each part of the pattern. Together, these searches ensure the identification of all approximate occurrences of a search pattern within a predefined limit of k errors. While existing literature offers designed schemes for up to k = 4 errors, designing search schemes for larger k values incurs escalating computational costs. Our method integrates a greedy algorithm and a novel Integer Linear Programming (ILP) formulation to design efficient search schemes for up to k = 7 errors. Comparative analyses demonstrate the superiority of our ILP-optimal schemes over alternative strategies in both theoretical and practical contexts. Additionally, we propose a dynamic scheme selection technique tailored to specific search patterns, further enhancing efficiency. Combined, this yields runtime reductions of up to 53% for higher k values. To facilitate search scheme generation, we present Hato, an open-source software tool (AGPL-3.0 license) employing the greedy algorithm and utilizing CPLEX for ILP solving. Furthermore, we introduce Columba 1.2, an open-source lossless read-mapper (AGPL-3.0 license) implemented in C++. Columba surpasses existing state-of-the-art tools by identifying all approximate occurrences of 100,000 Illumina reads (150 bp) in the human reference genome within 24 seconds (maximum edit distance of 4) and 75 seconds (maximum edit distance of 6) using a single CPU core. Notably, our study showcases Columba's capability to align 100,000 reads of length 50, with high error rates and up to an edit distance of 7, in a mere 2 hours and 15 minutes. This achievement is unmatched by other lossless aligners, which require over 3 hours for edit distance 5 alignments. Moreover, Columba exhibits a mapping rate four times higher than that of a lossy tool for this dataset.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"975-989"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142347648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Approximate and Exact Optimization Algorithms for the Beltway and Turnpike Problems with Duplicated, Missing, Partially Labeled, and Uncertain Measurements. 带重复、缺失、部分标记和不确定测量的环形公路和高速公路问题的近似和精确优化算法。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-10-10 DOI: 10.1089/cmb.2024.0661

C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford

{"title":"Approximate and Exact Optimization Algorithms for the Beltway and Turnpike Problems with Duplicated, Missing, Partially Labeled, and Uncertain Measurements.","authors":"C S Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford","doi":"10.1089/cmb.2024.0661","DOIUrl":"10.1089/cmb.2024.0661","url":null,"abstract":"The Turnpike problem aims to reconstruct a set of one-dimensional points from their unordered pairwise distances. Turnpike arises in biological applications such as molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes. Under noisy observation of the distances, the Turnpike problem is NP-hard and can take exponential time and space to solve when using traditional algorithms. To address this, we reframe the noisy Turnpike problem through the lens of optimization, seeking to simultaneously find the unknown point set and a permutation that maximizes similarity to the input distances. Our core contribution is a suite of algorithms that robustly solve this new objective. This includes a bilevel optimization framework that can efficiently solve Turnpike instances with up to 100,000 points. We show that this framework can be extended to scenarios with domain-specific constraints that include duplicated, missing, and partially labeled distances. Using these, we also extend our algorithms to work for points distributed on a circle (the Beltway problem). For small-scale applications that require global optimality, we formulate an integer linear program (ILP) that (i) accepts an objective from a generic family of convex functions and (ii) uses an extended formulation to reduce the number of binary variables. On synthetic and real partial digest data, our bilevel algorithms achieved state-of-the-art scalability across challenging scenarios with performance that matches or exceeds competing baselines. On small-scale instances, our ILP efficiently recovered ground-truth assignments and produced reconstructions that match or exceed our alternating algorithms. Our implementations are available at https://github.com/Kingsford-Group/turnpikesolvermm.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"908-926"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698667/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142466625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Protocol for Designing De Novo Noncanonical Peptide Binders in OSPREY. 在 OSPREY 中设计新的非简约肽结合剂的方案。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-10-04 DOI: 10.1089/cmb.2024.0669

Henry Childs, Nathan Guerin, Pei Zhou, Bruce R Donald

引用次数: 0

Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs^. 模式在哪里？彩色德布鲁因图的重复感知压缩。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-10-09 DOI: 10.1089/cmb.2024.0714

Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro

{"title":"Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs.","authors":"Alessio Campanelli, Giulio Ermanno Pibiri, Jason Fan, Rob Patro","doi":"10.1089/cmb.2024.0714","DOIUrl":"10.1089/cmb.2024.0714","url":null,"abstract":"We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1022-1044"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11631793/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Context-Aware Analysis of Genome Annotation Colocalization. 基因组注释定位的快速上下文感知分析

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-10-09 DOI: 10.1089/cmb.2024.0667

Askar Gafurov, Tomáš VinaŘ, Paul Medvedev, BroŇa Brejová

{"title":"Fast Context-Aware Analysis of Genome Annotation Colocalization.","authors":"Askar Gafurov, Tomáš VinaŘ, Paul Medvedev, BroŇa Brejová","doi":"10.1089/cmb.2024.0667","DOIUrl":"10.1089/cmb.2024.0667","url":null,"abstract":"An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes or their exons, sequence repeats, regions with a particular epigenetic state, and copy number variants. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing random unrelated annotations. To incorporate more background information into such analyses, we propose a new null model based on a Markov chain that differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or assembly gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. Moreover, the use of genomic contexts to correct for GC bias resulted in the reversal of some previously published findings.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"946-964"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11698669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142390933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Imputing Metagenomic Hi-C Contacts Facilitates the Integrative Contig Binning Through Constrained Random Walk with Restart. 通过重新开始的受限随机游走，推算元基因组 Hi-C 联系促进了整合式 Contig 分选。

IF 1.4 4区生物学

Journal of Computational Biology Pub Date : 2024-10-01 Epub Date: 2024-09-09 DOI: 10.1089/cmb.2024.0663

Yuxuan Du, Wenxuan Zuo, Fengzhu Sun

{"title":"Imputing Metagenomic Hi-C Contacts Facilitates the Integrative Contig Binning Through Constrained Random Walk with Restart.","authors":"Yuxuan Du, Wenxuan Zuo, Fengzhu Sun","doi":"10.1089/cmb.2024.0663","DOIUrl":"10.1089/cmb.2024.0663","url":null,"abstract":"Metagenomic Hi-C (metaHi-C) has shown remarkable potential for retrieving high-quality metagenome-assembled genomes from complex microbial communities. Nevertheless, existing metaHi-C-based contig binning methods solely rely on Hi-C interactions between contigs, disregarding crucial biological information such as the presence of single-copy marker genes. To overcome this limitation, we introduce ImputeCC, an integrative contig binning tool optimized for metaHi-C datasets. ImputeCC integrates both Hi-C interactions and the discriminative power of single-copy marker genes to group marker-gene-containing contigs into preliminary bins. It also introduces a novel constrained random walk with restart algorithm to enhance Hi-C connectivity among contigs. Comprehensive assessments using both mock and real metaHi-C datasets from diverse environments demonstrate that ImputeCC consistently outperforms other Hi-C-based contig binning tools. A genus-level analysis of the sheep gut microbiota reconstructed by ImputeCC underlines its capability to recover key species from dominant genera and identify previously unknown genera.","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":"1008-1021"},"PeriodicalIF":1.4,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142154267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0