Deniz Caliskan, Aylin Caliskan, Thomas Dandekar, Tim Breitenbach
{"title":"gSELECT: A novel pre-analysis machine-learning library enabling early hypothesis testing and predictive gene selection in single-cell data.","authors":"Deniz Caliskan, Aylin Caliskan, Thomas Dandekar, Tim Breitenbach","doi":"10.1016/j.csbj.2025.07.047","DOIUrl":null,"url":null,"abstract":"<p><p>Identifying biologically meaningful gene sets and evaluating their ability to separate conditions based on gene expression is an important step in many transcriptomic analyses. While most workflows support data-driven feature selection, few allow direct evaluation of predefined gene sets in a classification context. This limits the ability to assess literature-derived panels or biologically motivated hypotheses prior to downstream analysis. For this, we developed gSELECT, a Python library for evaluating the classification performance of both automatically ranked and user-defined gene sets. It operates on .csv or .h5ad expression matrices with group labels and can be easily integrated into existing analysis pipelines. Gene selection can be based on mutual information ranking, random sampling, or custom input. This supports hypothesis-driven testing without data-derived selection bias and allows direct evaluation of known or candidate markers. Classification is performed using multilayer perceptrons with Monte Carlo cross-validation, either on the full dataset or with a user-defined train/test split. Exhaustive and greedy strategies are available to explore combinatorial effects among genes to identify minimal gene combinations with high predictive power. gSELECT is intended as a pre-analysis tool to evaluate dataset separability and to support early assessment of candidate genes before committing to resource-intensive downstream analyses.</p>","PeriodicalId":10715,"journal":{"name":"Computational and structural biotechnology journal","volume":"27 ","pages":"3510-3527"},"PeriodicalIF":4.1000,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12354962/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational and structural biotechnology journal","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.csbj.2025.07.047","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Identifying biologically meaningful gene sets and evaluating their ability to separate conditions based on gene expression is an important step in many transcriptomic analyses. While most workflows support data-driven feature selection, few allow direct evaluation of predefined gene sets in a classification context. This limits the ability to assess literature-derived panels or biologically motivated hypotheses prior to downstream analysis. For this, we developed gSELECT, a Python library for evaluating the classification performance of both automatically ranked and user-defined gene sets. It operates on .csv or .h5ad expression matrices with group labels and can be easily integrated into existing analysis pipelines. Gene selection can be based on mutual information ranking, random sampling, or custom input. This supports hypothesis-driven testing without data-derived selection bias and allows direct evaluation of known or candidate markers. Classification is performed using multilayer perceptrons with Monte Carlo cross-validation, either on the full dataset or with a user-defined train/test split. Exhaustive and greedy strategies are available to explore combinatorial effects among genes to identify minimal gene combinations with high predictive power. gSELECT is intended as a pre-analysis tool to evaluate dataset separability and to support early assessment of candidate genes before committing to resource-intensive downstream analyses.
期刊介绍:
Computational and Structural Biotechnology Journal (CSBJ) is an online gold open access journal publishing research articles and reviews after full peer review. All articles are published, without barriers to access, immediately upon acceptance. The journal places a strong emphasis on functional and mechanistic understanding of how molecular components in a biological process work together through the application of computational methods. Structural data may provide such insights, but they are not a pre-requisite for publication in the journal. Specific areas of interest include, but are not limited to:
Structure and function of proteins, nucleic acids and other macromolecules
Structure and function of multi-component complexes
Protein folding, processing and degradation
Enzymology
Computational and structural studies of plant systems
Microbial Informatics
Genomics
Proteomics
Metabolomics
Algorithms and Hypothesis in Bioinformatics
Mathematical and Theoretical Biology
Computational Chemistry and Drug Discovery
Microscopy and Molecular Imaging
Nanotechnology
Systems and Synthetic Biology