Bioinformatics (Oxford, England)最新文献

筛选
英文 中文
GeOKG: geometry-aware knowledge graph embedding for Gene Ontology and genes. 基因本体和基因的几何感知知识图嵌入。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf160
Chang-Uk Jeong, Jaesik Kim, Dokyoon Kim, Kyung-Ah Sohn
{"title":"GeOKG: geometry-aware knowledge graph embedding for Gene Ontology and genes.","authors":"Chang-Uk Jeong, Jaesik Kim, Dokyoon Kim, Kyung-Ah Sohn","doi":"10.1093/bioinformatics/btaf160","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf160","url":null,"abstract":"<p><strong>Motivation: </strong>Leveraging deep learning for the representation learning of Gene Ontology (GO) and Gene Ontology Annotation (GOA) holds significant promise for enhancing downstream biological tasks such as protein-protein interaction prediction. Prior approaches have predominantly used text- and graph-based methods, embedding GO and GOA in a single geometric space (e.g. Euclidean or hyperbolic). However, since the GO graph exhibits a complex and nonmonotonic hierarchy, single-space embeddings are insufficient to fully capture its structural nuances.</p><p><strong>Results: </strong>In this study, we address this limitation by exploiting geometric interaction to better reflect the intricate hierarchical structure of GO. Our proposed method, Geometry-Aware Knowledge Graph Embeddings for GO and Genes (GeOKG), leverages interactions among various geometric representations during training, thereby modeling the complex hierarchy of GO more effectively. Experiments at the GO level demonstrate the benefits of incorporating these geometric interactions, while gene-level tests reveal that GeOKG outperforms existing methods in protein-protein interaction prediction. These findings highlight the potential of using geometric interaction for embedding heterogeneous biomedical networks.</p><p><strong>Availability and implementation: </strong>https://github.com/ukjung21/GeOKG.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"41 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12036960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144060577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PyEvoCell: an LLM-augmented single-cell trajectory analysis dashboard. PyEvoCell: llm增强的单细胞轨迹分析仪表板。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf158
Sachin Mathur, Mathieu Beauvais, Arnau Giribet, Nicolas Aragon Barrero, Chaorui-Tom Zhang, Towsif Rahman, Seqian Wang, Jeremy Huang, Nima Nouri, Andre Kurlovs, Ziv Bar-Joseph, Peyman Passban
{"title":"PyEvoCell: an LLM-augmented single-cell trajectory analysis dashboard.","authors":"Sachin Mathur, Mathieu Beauvais, Arnau Giribet, Nicolas Aragon Barrero, Chaorui-Tom Zhang, Towsif Rahman, Seqian Wang, Jeremy Huang, Nima Nouri, Andre Kurlovs, Ziv Bar-Joseph, Peyman Passban","doi":"10.1093/bioinformatics/btaf158","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf158","url":null,"abstract":"<p><strong>Motivation: </strong>Several methods have been developed for trajectory inference in single-cell studies. However, identifying relevant lineages among several cell types and interpreting the results of downstream analysis remains a challenging task that requires deep understanding of various cell type transitions and progression patterns. Therefore, there is a need for methods that can aid researchers in the analysis and interpretation of such trajectories.</p><p><strong>Results: </strong>We developed PyEvoCell, a dashboard for trajectory interpretation and analysis that is augmented by large language model (LLM) capabilities. PyEvoCell applies the LLM to the outputs of trajectory inference methods such as Monocle3, to suggest biologically relevant lineages. Once a lineage is defined, users can conduct differential expression and functional analyses which are also interpreted by the LLM. Finally, any hypothesis or claim derived from the analysis can be validated using the veracity filter, a feature enabled by the LLM, to confirm or reject claims by providing relevant PubMed citations.</p><p><strong>Availability and implementation: </strong>The software is available at https://github.com/Sanofi-Public/PyEvoCell. It contains installation instructions, user manual, demo datasets, as well as license conditions. https://doi.org/10.5281/zenodo.15114803.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":"41 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12014098/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144032003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HISSTA: a human in situ single-cell transcriptome atlas. HISSTA:人类原位单细胞转录组图谱。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf142
Jiwon Yu, Jiwoo Moon, Minseo Kim, Gyeol Han, Insu Jang, Jinyoung Lim, Seungmook Lee, Seok-Hwan Yoon, Woong-Yang Park, Byungwook Lee, Sanghyuk Lee
{"title":"HISSTA: a human in situ single-cell transcriptome atlas.","authors":"Jiwon Yu, Jiwoo Moon, Minseo Kim, Gyeol Han, Insu Jang, Jinyoung Lim, Seungmook Lee, Seok-Hwan Yoon, Woong-Yang Park, Byungwook Lee, Sanghyuk Lee","doi":"10.1093/bioinformatics/btaf142","DOIUrl":"10.1093/bioinformatics/btaf142","url":null,"abstract":"<p><strong>Motivation: </strong>Spatial transcriptomics holds great promise for revolutionizing biology and medicine by providing gene expression profiles with spatial information. Until recently, spatial resolution has been limited, but advances in high-throughput in situ imaging technologies now offer new opportunities by covering thousands of genes at a single-cell or even subcellular resolution, necessitating databases dedicated to comprehensive coverage and analysis with user-friendly intefaces.</p><p><strong>Results: </strong>We introduce the HISSTA database, which facilitates the archival and analysis of in situ transcriptome data at single-cell resolution from various human tissues. We have collected and annotated spatial transcriptome data generated by MERFISH, CosMx SMI, and Xenium techniques, encompassing 112 samples and 28 million cells across 16 tissue types from 63 studies. To decipher spatial contexts, we have implemented advanced tools for cell type annotation, spatial colocalization, spatial cellular communication, and niche analyses. Notably, all datasets and annotations are interactively accessible through Vitessce, allowing users to focus on regions of interest and examine gene expression in detail. HISSTA is a unique database designed to manage the rapidly growing dataset of in situ transcriptomes at single-cell resolution. Given its comprehensive data content and advanced analysis tools with interactive visualizations, HISSTA is poised to significantly impact cancer diagnosis, precision medicine, and digital pathology.</p><p><strong>Availability and implementation: </strong>HISSTA is freely accessible at https://kbds.re.kr/hissta/. The source code is available at https://doi.org/10.5281/zenodo.14904523.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12002909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143756505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Marker selection strategies for circulating tumor DNA guided by phylogenetic inference. 基于系统发育推断的循环肿瘤DNA标记选择策略。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf145
Xuecong Fu, Zhicheng Luo, Yueqian Deng, William LaFramboise, David Bartlett, Russell Schwartz
{"title":"Marker selection strategies for circulating tumor DNA guided by phylogenetic inference.","authors":"Xuecong Fu, Zhicheng Luo, Yueqian Deng, William LaFramboise, David Bartlett, Russell Schwartz","doi":"10.1093/bioinformatics/btaf145","DOIUrl":"10.1093/bioinformatics/btaf145","url":null,"abstract":"<p><strong>Motivation: </strong>Blood-based profiling of tumor DNA (\"liquid biopsy\") offers great prospects for non-invasive early cancer diagnosis and clinical guidance, but requires further computational advances to become a robust quantitative assay of tumor clonal evolution. We propose new methods to better characterize tumor clonal dynamics from circulating tumor DNA (ctDNA), through application to two specific tasks: (i) applying longitudinal ctDNA data to refine phylogeny models of clonal evolution, and (ii) quantifying changes in clonal frequencies that may be indicative of treatment response or tumor progression. We pose these through a probabilistic framework for optimally identifying markers and using them to characterize clonal evolution.</p><p><strong>Results: </strong>We first estimate a density over clonal tree models using bootstrap samples over pre-treatment tissue-based sequence data. We then refine these models over successive longitudinal samples. We use the resulting framework for modeling and refining tree densities to pose a set of optimization problems for selecting ctDNA markers to maximize measures of utility for reducing uncertainty in phylogeny models and quantifying clonal frequencies given the models. We tested our methods on synthetic data and showed them to be effective at refining tree densities and inferring clonal frequencies. Application to real tumor data further demonstrated the methods' effectiveness in refining a lineage model and assessing its clonal frequencies. The work shows the power of computational methods to improve marker selection, clonal lineage reconstruction, and clonal dynamics profiling for more precise and quantitative assays of somatic evolution and tumor progression.</p><p><strong>Availability and implementation: </strong>https://github.com/CMUSchwartzLab/Mase-phi.git. (DOI: 10.5281/zenodo.14776163).</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12002908/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143756506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XPRS: a tool for interpretable and explainable polygenic risk score. XPRS:可解释和可解释的多基因风险评分工具。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf143
Na Yeon Kim, Seunggeun Lee
{"title":"XPRS: a tool for interpretable and explainable polygenic risk score.","authors":"Na Yeon Kim, Seunggeun Lee","doi":"10.1093/bioinformatics/btaf143","DOIUrl":"10.1093/bioinformatics/btaf143","url":null,"abstract":"<p><strong>Summary: </strong>The polygenic risk score (PRS) is an important method for assessing genetic susceptibility to diseases; however, its clinical utility is limited by a lack of interpretability tools. To address this problem, we introduce eXplainable PRS (XPRS), an interpretation and visualization tool that decomposes PRSs into genes/regions and single nucleotide polymorphism (SNP) contribution scores via Shapley additive explanations (SHAPs), which provide insights into specific genes and SNPs that significantly contribute to the PRS of an individual. This software features a multilevel visualization approach, including Manhattan plots, LocusZoom-like plots, and tables at the population and individual levels, to highlight important genes and SNPs. By implementing with a user-friendly web interface, XPRS allows for straightforward data input and interpretation. By bridging the gap between complex genetic data and actionable clinical insights, XPRS can improve communication between clinicians and patients.</p><p><strong>Availability and implementation: </strong>The XPRS software is publicly available on GitHub at https://github.com/nayeonkim93/XPRS and can see the demo through our cloud-based web service at https://xprs.leelabsg.org/.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12043004/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143756448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Realfreq: real-time base modification analysis for nanopore sequencing. Realfreq:用于纳米孔测序的实时碱基修饰分析。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf151
Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi
{"title":"Realfreq: real-time base modification analysis for nanopore sequencing.","authors":"Suneth Samarasinghe, Ira Deveson, Hasindu Gamaarachchi","doi":"10.1093/bioinformatics/btaf151","DOIUrl":"10.1093/bioinformatics/btaf151","url":null,"abstract":"<p><strong>Summary: </strong>Nanopore sequencers allow sequencing data to be accessed in real-time. This allows live analysis to be performed, while the sequencing is running, reducing the turnaround time of the results. We introduce realfreq, a framework for obtaining real-time base modification frequencies while a nanopore sequencer is in operation. Realfreq calculates and allows access to the real-time base modification frequency results while the sequencer is running. We demonstrate that the data analysis rate with realfreq on a laptop computer can keep up with the output data rate of a nanopore MinION sequencer, while a desktop computer can keep up with a single PromethION 2 solo flowcell.</p><p><strong>Availability and implementation: </strong>Realfreq is a free and open-source application implemented in C programming language and shell scripts. The source code and the documentation for realfreq can be found at https://github.com/imsuneth/realfreq. The version used for the manuscript is also available at https://doi.org/10.5281/zenodo.15128668.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12079415/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143797261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CytoSimplex: visualizing single-cell fates and transitions on a simplex. CytoSimplex:在单纯形上可视化单细胞命运和转变。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf119
Jialin Liu, Yichen Wang, Chen Li, Yichen Gu, Noriaki Ono, Joshua Welch
{"title":"CytoSimplex: visualizing single-cell fates and transitions on a simplex.","authors":"Jialin Liu, Yichen Wang, Chen Li, Yichen Gu, Noriaki Ono, Joshua Welch","doi":"10.1093/bioinformatics/btaf119","DOIUrl":"10.1093/bioinformatics/btaf119","url":null,"abstract":"<p><strong>Summary: </strong>Cells differentiate to their final fates along unique trajectories, often involving multi-potent progenitors that can produce multiple terminally differentiated cell types. Recent developments in single-cell transcriptomic and epigenomic measurement provide tremendous opportunities for mapping these trajectories. The visualization of single-cell data often relies on dimension reduction methods such as UMAP to simplify high-dimensional single-cell data down to an understandable 2D form. However, these dimension reduction methods are not constructed to allow direct interpretation of the reduced dimensions in terms of cell differentiation. To address these limitations, we developed a new approach that places each cell from a single-cell dataset within a simplex whose vertices correspond to terminally differentiated cell types. Our approach can quantify and visualize current cell fate commitment and future cell potential. We developed CytoSimplex, a standalone open-source package implemented in R and Python that provides simple and intuitive visualizations of cell differentiation in 2D ternary and 3D quaternary plots. We believe that CytoSimplex can help researchers gain a better understanding of cell type transitions in specific tissues and characterize developmental processes.</p><p><strong>Availability and implementation: </strong>The R version of CytoSimplex is available on Github at https://github.com/welch-lab/CytoSimplex. The Python version of CytoSimplex is available on Github at https://github.com/welch-lab/pyCytoSimplex.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11992338/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143694776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OLTA: Optimizing bait seLection for TArgeted sequencing. OLTA:为定向测序优化饵料检测。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf146
Mete Orhun Minbay, Richard Sun, Vijay Ramachandran, Ahmet Ay, Tamer Kahveci
{"title":"OLTA: Optimizing bait seLection for TArgeted sequencing.","authors":"Mete Orhun Minbay, Richard Sun, Vijay Ramachandran, Ahmet Ay, Tamer Kahveci","doi":"10.1093/bioinformatics/btaf146","DOIUrl":"10.1093/bioinformatics/btaf146","url":null,"abstract":"<p><strong>Motivation: </strong>Targeted enrichment via capture probes, also known as baits, is a promising complementary procedure for next-generation sequencing methods. This technique uses short biotinylated oligonucleotide probes that hybridize with complementary genetic material in a sample. Following hybridization, the target fragments can be easily isolated and processed with minimal contamination from irrelevant material. Designing an efficient set of baits for a set of target sequences, however, is an NP-hard problem.</p><p><strong>Results: </strong>We develop a novel heuristic algorithm that leverages the similarities between the characteristics of the Minimum Bait Cover and the Closest String problems to reduce the number of baits to cover a given target sequence. Our results on real and synthetic datasets demonstrate that our algorithm, OLTA produces fewest baits for nearly all experimental settings and datasets. On average, it produces 6% and 11% fewer baits than the next best state-of-the-art methods for two major real datasets, AIV and MEGARES. Also, its bait set has the highest utilization and the minimum redundancy.</p><p><strong>Availability and implementation: </strong>Our algorithm is available at github.com/FuelTheBurn/OLTA-Optimizing-bait-seLection-for-TArgeted-sequencing. Test data and other software are archived at doi.org/10.5281/zenodo.15086636.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12033030/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143775144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RNALoc-LM: RNA subcellular localization prediction using pre-trained RNA language model. RNALoc-LM:基于预训练RNA语言模型的RNA亚细胞定位预测。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf127
Min Zeng, Xinyu Zhang, Yiming Li, Chengqian Lu, Rui Yin, Fei Guo, Min Li
{"title":"RNALoc-LM: RNA subcellular localization prediction using pre-trained RNA language model.","authors":"Min Zeng, Xinyu Zhang, Yiming Li, Chengqian Lu, Rui Yin, Fei Guo, Min Li","doi":"10.1093/bioinformatics/btaf127","DOIUrl":"10.1093/bioinformatics/btaf127","url":null,"abstract":"<p><strong>Motivation: </strong>Accurately predicting RNA subcellular localization is crucial for understanding the cellular functions and regulatory mechanisms of RNAs. Although many computational methods have been developed to predict the subcellular localization of lncRNAs, miRNAs, and circRNAs, very few of them are designed to simultaneously predict the subcellular localization of multiple types of RNAs. In addition, the emergence of pre-trained RNA language model has shown remarkable performance in various bioinformatics tasks, such as structure prediction and functional annotation. Despite these advancements, there remains a significant gap in applying pre-trained RNA language models specifically for predicting RNA subcellular localization.</p><p><strong>Results: </strong>In this study, we proposed RNALoc-LM, the first interpretable deep-learning framework that leverages a pre-trained RNA language model for predicting RNA subcellular localization. RNALoc-LM uses a pre-trained RNA language model to encode RNA sequences, then captures local patterns and long-range dependencies through TextCNN and BiLSTM modules. A multi-head attention mechanism is used to focus on important regions within the RNA sequences. The results demonstrate that RNALoc-LM significantly outperforms both deep-learning baselines and existing state-of-the-art predictors. Additionally, motif analysis highlights RNALoc-LM's potential for discovering important motifs, while an ablation study confirms the effectiveness of the RNA sequence embeddings generated by the pre-trained RNA language model.</p><p><strong>Availability and implementation: </strong>The RNALoc-LM web server is available at http://csuligroup.com:8000/RNALoc-LM. The source code can be obtained from https://github.com/CSUBioGroup/RNALoc-LM.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11978386/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143694782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters. 负数据集选择影响基于机器学习的多种细菌物种启动子的预测。
Bioinformatics (Oxford, England) Pub Date : 2025-03-29 DOI: 10.1093/bioinformatics/btaf135
Marcelo González, Roberto E Durán, Michael Seeger, Mauricio Araya, Nicolás Jara
{"title":"Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters.","authors":"Marcelo González, Roberto E Durán, Michael Seeger, Mauricio Araya, Nicolás Jara","doi":"10.1093/bioinformatics/btaf135","DOIUrl":"10.1093/bioinformatics/btaf135","url":null,"abstract":"<p><strong>Motivation: </strong>Advances in bacterial promoter predictors based on machine learning have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified in GC-content discrepancies between positive and negative datasets in single-species models. This study aims to investigate whether multiple-species models for promoter classification are inherently biased due to the selection criteria of negative datasets. We further explore whether the generation of synthetic random sequences (SRS) that mimic GC-content distribution of promoters can partly reduce this bias.</p><p><strong>Results: </strong>Multiple-species predictors exhibited GC-content bias when using CDS as a negative dataset, suggested by specificity and sensibility metrics in a species-specific manner, and investigated by dimensionality reduction. We demonstrated a reduction in this bias by using the SRS dataset, with less detection of background noise in real genomic data. In both scenarios DNABERT showed the best metrics. These findings suggest that GC-balanced datasets can enhance the generalizability of promoter predictors across Bacteria.</p><p><strong>Availability and implementation: </strong>The source code of the experiments is freely available at https://github.com/maigonzalezh/MultispeciesPromoterClassifier.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11993300/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143733649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信