Bioinformatics (Oxford, England)最新文献

筛选
英文 中文
Predicting Explainable Dementia Types with LLM-aided Feature Engineering.
Bioinformatics (Oxford, England) Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf156
Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, Chris Callison-Burch
{"title":"Predicting Explainable Dementia Types with LLM-aided Feature Engineering.","authors":"Aditya M Kashyap, Delip Rao, Mary Regina Boland, Li Shen, Chris Callison-Burch","doi":"10.1093/bioinformatics/btaf156","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf156","url":null,"abstract":"<p><strong>Motivation: </strong>The integration of Machine Learning (ML) and Artificial Intelligence (AI) into healthcare has immense potential due to the rapidly growing volume of clinical data. However, existing AI models, particularly Large Language Models (LLMs) like GPT-4, face significant challenges in terms of explainability and reliability, particularly in high-stakes domains like healthcare.</p><p><strong>Results: </strong>This paper proposes a novel LLM-aided feature engineering approach that enhances interpretability by extracting clinically relevant features from the Oxford Textbook of Medicine. By converting clinical notes into concept vector representations and employing a linear classifier, our method achieved an accuracy of 0.72, outperforming a traditional n-gram Logistic Regression baseline (0.64) and the GPT-4 baseline (0.48), while focusing on high level clinical features. We also explore using Text Embeddings to reduce the overall time and cost of our approach by 97%.</p><p><strong>Availability: </strong>All code relevant to this paper is available at: https://github.com/AdityaKashyap423/Dementia_LLM_Feature_Engineering/tree/main.</p><p><strong>Supplementary information: </strong>Supplementary PDF and other data files can be found at https://drive.google.com/drive/folders/1UqdpsKFnvGjUJgp58k3RYcJ8zN8zPmWR?usp=share_link .</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Gradient matching accelerates mixed-effects inference for biochemical networks.
Bioinformatics (Oxford, England) Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf154
Yulan B van Oppen, Andreas Milias-Argeitis
{"title":"Gradient matching accelerates mixed-effects inference for biochemical networks.","authors":"Yulan B van Oppen, Andreas Milias-Argeitis","doi":"10.1093/bioinformatics/btaf154","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf154","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell time series data often exhibit significant variability within an isogenic cell population. When modeling intracellular processes, it is therefore more appropriate to infer parameter distributions that reflect this variability, rather than fitting the population average to obtain a single point estimate. The Global Two-Stage (GTS) approach for nonlinear mixed-effects (NLME) models is a simple and modular method commonly used for this purpose. However, this method is computationally intensive due to its repeated use of non-convex optimization and numerical integration of the underlying system.</p><p><strong>Results: </strong>We propose the Gradient Matching GTS (GMGTS) method as an efficient alternative to GTS. Gradient matching offers an integration-free approach to parameter estimation that is particularly powerful for systems that are linear in the unknown parameters, such as biochemical networks modeled by mass action kinetics. By incorporating gradient matching into the GTS framework, we expand its capabilities through uncertainty propagation calculations and an iterative estimation scheme for partially observed systems. Comparisons between GMGTS and GTS across various inference setups show that our method significantly reduces computational demands, facilitating the application of complex NLME models in systems biology.</p><p><strong>Availability and implementation: </strong>A Matlab implementation of GMGTS is provided at https://github.com/yulanvanoppen/GMGTS (DOI: http://doi.org/10.5281/zenodo.14884457).</p><p><strong>Supplementary information: </strong>Supplemental Information is available online and contains Tables S1-S4, Figures S1-S21, methodology, mathematical derivations, and software implementation details.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms.
Bioinformatics (Oxford, England) Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf153
Dennis J Hazelett
{"title":"Rethinking GWAS: how lessons from genetic screens and artificial intelligence could reveal biological mechanisms.","authors":"Dennis J Hazelett","doi":"10.1093/bioinformatics/btaf153","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf153","url":null,"abstract":"<p><strong>Motivation: </strong>Modern single-cell omics data are key to unraveling the complex mechanisms underlying risk for complex diseases revealed by genome-wide association studies (GWAS). Phenotypic screens in model organisms have several important parallels to GWAS which I explore in this essay.</p><p><strong>Results: </strong>I provide the historical context of such screens, comparing and contrasting similarities to association studies, and how these screens in model organisms can teach us what to look for. Then I consider how the results of GWAS might be exhaustively interrogated to interpret the biological mechanisms underpinning disease processes. Finally, I propose a general framework for tackling this problem computationally, and explore the data, mechanisms, and technology (both existing and yet to be invented) that are necessary to complete the task.</p><p><strong>Availability and implementation: </strong>There are no data or code associated with this article.</p><p><strong>Supplementary information: </strong>Not applicable.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ROICellTrack: A deep learning framework for integrating cellular imaging modalities in subcellular spatial transcriptomic profiling of tumor tissues.
Bioinformatics (Oxford, England) Pub Date : 2025-04-08 DOI: 10.1093/bioinformatics/btaf152
Xiaofei Song, Xiaoqing Yu, Carlos M Moran-Segura, Hongzhi Xu, Tingyi Li, Joshua T Davis, Aram Vosoughi, G Daniel Grass, Roger Li, Xuefeng Wang
{"title":"ROICellTrack: A deep learning framework for integrating cellular imaging modalities in subcellular spatial transcriptomic profiling of tumor tissues.","authors":"Xiaofei Song, Xiaoqing Yu, Carlos M Moran-Segura, Hongzhi Xu, Tingyi Li, Joshua T Davis, Aram Vosoughi, G Daniel Grass, Roger Li, Xuefeng Wang","doi":"10.1093/bioinformatics/btaf152","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf152","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics (ST) technologies, such as GeoMx Digital Spatial Profiler, are increasingly utilized to investigate the role of diverse tumor microenvironment components, particularly in relation to cancer progression, treatment response, and therapeutic resistance. However, in many ST studies, the spatial information obtained from immunofluorescence imaging is primarily used for identifying regions of interest rather than as an integral part of downstream transcriptomic data analysis and interpretation.</p><p><strong>Results: </strong>We developed ROICellTrack, a deep learning-based framework that better integrates cellular imaging with spatial transcriptomic profiling. By analyzing 56 ROIs from urothelial carcinoma of the bladder (UCB) and upper tract urothelial carcinoma (UTUC), ROICellTrack identified distinct cancer-immune cell mixtures, characterized by specific transcriptomic and morphological signatures and receptor-ligand interactions linked to tumor content and immune infiltrations. Our findings demonstrate the value of integrating imaging with transcriptomics to analyze spatial omics data, improving our understanding of tumor heterogeneity and its relevance to personalized and targeted therapies.</p><p><strong>Availability: </strong>ROICellTrack is publicly available at https://github.com/wanglab1/ROICellTrack.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143813180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Realfreq: Real-time base modification analysis for nanopore sequencing.
Bioinformatics (Oxford, England) Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf151
Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi
{"title":"Realfreq: Real-time base modification analysis for nanopore sequencing.","authors":"Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi","doi":"10.1093/bioinformatics/btaf151","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf151","url":null,"abstract":"<p><strong>Summary: </strong>Nanopore sequencers allow sequencing data to be accessed in real-time. This allows live analysis to be performed, while the sequencing is running, reducing the turnaround time of the results. We introduce realfreq, a framework for obtaining real-time base modification frequencies while a nanopore sequencer is in operation. Realfreq calculates and allows access to the real-time base modification frequency results while the sequencer is running. We demonstrate that the data analysis rate with realfreq on a laptop computer can keep up with the output data rate of a nanopore MinION sequencer, while a desktop computer can keep up with a single PromethION 2 solo flowcell.</p><p><strong>Availability and implementation: </strong>Realfreq is a free and open-source application implemented in C programming language and shell scripts. The source code and the documentation for realfreq can be found at https://github.com/imsuneth/realfreq. The version used for the manuscript is also available at 10.5281/zenodo.15128668.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CoverM: Read alignment statistics for metagenomics.
Bioinformatics (Oxford, England) Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf147
Samuel T N Aroney, Rhys J P Newell, Jakob N Nissen, Antonio Pedro Camargo, Gene W Tyson, Ben J Woodcroft
{"title":"CoverM: Read alignment statistics for metagenomics.","authors":"Samuel T N Aroney, Rhys J P Newell, Jakob N Nissen, Antonio Pedro Camargo, Gene W Tyson, Ben J Woodcroft","doi":"10.1093/bioinformatics/btaf147","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf147","url":null,"abstract":"<p><strong>Summary: </strong>Genome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses 'Mosdepth arrays' for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results.</p><p><strong>Availability and implementation: </strong>CoverM is free software available at https://github.com/wwood/coverm. CoverM is implemented in Rust, with Python (https://github.com/apcamargo/pycoverm) and Julia (https://github.com/JuliaBinaryWrappers/CoverM_jll.jl) interfaces.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topology-Driven Negative Sampling Enhances Generalizability in Protein-Protein Interaction Prediction.
Bioinformatics (Oxford, England) Pub Date : 2025-04-07 DOI: 10.1093/bioinformatics/btaf148
Ayan Chatterjee, Babak Ravandi, Parham Haddadi, Naomi H Philip, Mario Abdelmessih, William R Mowrey, Piero Ricchiuto, Yupu Liang, Wei Ding, Juan C Mobarec, Tina Eliassi-Rad
{"title":"Topology-Driven Negative Sampling Enhances Generalizability in Protein-Protein Interaction Prediction.","authors":"Ayan Chatterjee, Babak Ravandi, Parham Haddadi, Naomi H Philip, Mario Abdelmessih, William R Mowrey, Piero Ricchiuto, Yupu Liang, Wei Ding, Juan C Mobarec, Tina Eliassi-Rad","doi":"10.1093/bioinformatics/btaf148","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf148","url":null,"abstract":"<p><strong>Motivation: </strong>Unraveling the human interactome to uncover disease-specific patterns and discover drug targets hinges on accurate protein-protein interaction (PPI) predictions. However, challenges persist in machine learning (ML) models due to a scarcity of quality hard negative samples, shortcut learning, and limited generalizability to novel proteins.</p><p><strong>Results: </strong>In this study, we introduce a novel approach for strategic sampling of protein-protein non-interactions (PPNIs) by leveraging higher-order network characteristics that capture the inherent complementarity-driven mechanisms of PPIs. Next, we introduce UPNA-PPI (Unsupervised Pre-training of Node Attributes tuned for PPI), a high throughput sequence-to-function ML pipeline, integrating unsupervised pre-training in protein representation learning with Topological PPNI (TPPNI) samples, capable of efficiently screening billions of interactions. By using our TPPNI in training the UPNA-PPI model, we improve PPI prediction generalizability and interpretability, particularly in identifying potential binding sites locations on amino acid sequences, strengthening the prioritization of screening assays and facilitating the transferability of ML predictions across protein families and homodimers. UPNA-PPI establishes the foundation for a fundamental negative sampling methodology in graph machine learning by integrating insights from network topology.</p><p><strong>Availability and implementation: </strong>Code and UPNA-PPI predictions are freely available at https://github.com/alxndgb/UPNA-PPI.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143805092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources. miss-SNF:处理完全缺失数据源的多模态患者相似性网络整合方法。
Bioinformatics (Oxford, England) Pub Date : 2025-04-04 DOI: 10.1093/bioinformatics/btaf150
Jessica Gliozzo, Mauricio A Soto Gomez, Arturo Bonometti, Alex Patak, Elena Casiraghi, Giorgio Valentini
{"title":"miss-SNF: a multimodal patient similarity network integration approach to handle completely missing data sources.","authors":"Jessica Gliozzo, Mauricio A Soto Gomez, Arturo Bonometti, Alex Patak, Elena Casiraghi, Giorgio Valentini","doi":"10.1093/bioinformatics/btaf150","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf150","url":null,"abstract":"<p><strong>Motivation: </strong>Precision medicine leverages patient-specific multimodal data to improve prevention, diagnosis, prognosis and treatment of diseases. Advancing precision medicine requires the non-trivial integration of complex, heterogeneous and potentially high-dimensional data sources, such as multi-omics and clinical data. In the literature several approaches have been proposed to manage missing data, but are usually limited to the recovery of subsets of features for a subset of patients. A largely overlooked problem is the integration of multiple sources of data when one or more of them are completely missing for a subset of patients, a relatively common condition in clinical practice.</p><p><strong>Results: </strong>We propose miss-Similarity Network Fusion (miss-SNF), a novel general-purpose data integration approach designed to manage completely missing data in the context of patient similarity networks. Miss-SNF integrates incomplete unimodal patient similarity networks by leveraging a non-linear message-passing strategy borrowed from the SNF algorithm. Miss-SNF is able to recover missing patient similarities and is \"task agnostic\", in the sense that can integrate partial data for both unsupervised and supervised prediction tasks. Experimental analyses on nine cancer datasets from The Cancer Genome Atlas (TCGA) demonstrate that miss-SNF achieves state-of-the-art results in recovering similarities and in identifying patients subgroups enriched in clinically relevant variables and having differential survival. Moreover, amputation experiments show that miss-SNF supervised prediction of cancer clinical outcomes and Alzheimer's disease diagnosis with completely missing data achieves results comparable to those obtained when all the data are available.</p><p><strong>Availability and implementation: </strong>miss-SNF code, implemented in R, is available at https://github.com/AnacletoLAB/missSNF.</p><p><strong>Supplementary information: </strong>Supplementary information are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AlertGS: Determining alerts for gene sets.
Bioinformatics (Oxford, England) Pub Date : 2025-04-03 DOI: 10.1093/bioinformatics/btaf133
Franziska Kappenberg, Jörg Rahnenführer
{"title":"AlertGS: Determining alerts for gene sets.","authors":"Franziska Kappenberg, Jörg Rahnenführer","doi":"10.1093/bioinformatics/btaf133","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf133","url":null,"abstract":"<p><strong>Motivation: </strong>A typical goal in gene expression studies is identifying certain gene sets enriched with significant genes. The measurement of many gene expression experiments for several concentrations or time points allows the modeling of the concentration/time-response relationship for each gene, and the subsequent estimation of a gene-wise alert. In this work, an approach is proposed to transfer the concept of alerts from single genes to gene sets, yielding a global significance statement and the respective concentration or time where the first enrichment of the gene set can be observed. The methodology is based on a Kolmogorov-Smirnoff type test statistic for each gene set.</p><p><strong>Results: </strong>Simulations show that a majority of these sets can be identified especially for lower numbers of true gene sets with a signal. The false positive rate can be controlled by subsequent decorrelation approaches. Overall, the true gene set-wise alerts are rarely overestimated and rather tend to be underestimated.</p><p><strong>Availability and implementation: </strong>The code needed to reproduce the simulations and apply the AlertGS methodology is available at the GitHub repository https://github.com/FKappenberg/AlertGS.</p><p><strong>Supplementary information: </strong>Supplementary material is available online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Demixer: A probabilistic generative model to delineate different strains of a microbial species in a mixed infection sample.
Bioinformatics (Oxford, England) Pub Date : 2025-04-03 DOI: 10.1093/bioinformatics/btaf139
V P Brintha, Manikandan Narayanan
{"title":"Demixer: A probabilistic generative model to delineate different strains of a microbial species in a mixed infection sample.","authors":"V P Brintha, Manikandan Narayanan","doi":"10.1093/bioinformatics/btaf139","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf139","url":null,"abstract":"<p><strong>Motivation: </strong>Multi-drug resistant or hetero-resistant Tuberculosis (TB) hinders the successful treatment of TB. Hetero-resistant TB occurs when multiple strains of the TB-causing bacterium with varying degrees of drug susceptibility are present in an individual. Existing studies predicting the proportion and identity of strains in a mixed infection sample rely on a reference database of known strains. A main challenge then is to identify de novo strains not present in the reference database, while quantifying the proportion of known strains.</p><p><strong>Results: </strong>We present Demixer, a probabilistic generative model that uses a combination of reference-based and reference-free techniques to delineate mixed infection strains in whole genome sequencing (WGS) data. Demixer extends a topic model widely used in text mining to represent known mutations and discover novel ones. Parallelization and other heuristics enabled Demixer to process large datasets like CRyPTIC (Comprehensive Resistance Prediction for Tuberculosis: an International Consortium). In both synthetic and experimental benchmark datasets, our proposed method precisely detected the identity (e.g., 91.67% accuracy on the experimental in vitro dataset) as well as the proportions of the mixed strains. In real-world applications, Demixer revealed novel high confidence mixed infections (101 out of 1,963 Malawi samples analyzed), and new insights into the global frequency of mixed infection (2% at the most stringent threshold in the CRyPTIC dataset) and its significant association to drug resistance. Our approach is generalizable and hence applicable to any bacterial and viral WGS data.</p><p><strong>Availability: </strong>All code relevant to Demixer is available at https://github.com/BIRDSgroup/Demixer.</p><p><strong>Supplementary information: </strong>Suppl Information PDF file (containing Suppl Methods/Algorithms/Tables/Figures), and other Suppl Data Files are available at this link: https://drive.google.com/drive/folders/1P_OX_MbZ6QFN9Amyl2eGMBr1ySY6yNWu? usp=drive_link. The Suppl data, code and vcf files (of in vitro, synthetic and real-world datasets) have also been archived at Zenodo (doi: 10.5281/zenodo.15074330).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143782272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信