{"title":"SurvMarker: an R package for identifying survival-associated molecular features using PCA-based weighted scores.","authors":"Dona Hasini Gammune, Tongjun Gu","doi":"10.1186/s12859-026-06461-9","DOIUrl":"https://doi.org/10.1186/s12859-026-06461-9","url":null,"abstract":"<p><strong>Background: </strong>Identification of prognostic molecular features from high-dimensional molecular data is central to biomarker discovery in cancer and other complex diseases. Principal component analysis (PCA) is widely used for dimensionality reduction in survival studies, yet selecting individual features from principal components (PCs) remains challenging and often relies on arbitrary thresholds. To address this limitation, we developed SurvMarker, an R package that prioritizes survival-associated molecular features using a PCA-based scoring framework.</p><p><strong>Results: </strong>SurvMarker applies PCA to normalized molecular data, jointly evaluates PCs using multivariable Cox proportional hazards models, and ranks features by aggregating absolute loadings across survival-associated PCs. Feature significance is assessed using an empirical null framework with false discovery rate control. In both synthetic global-null and permutation-based null simulations, SurvMarker showed comparative or better false positive control, particularly in small-n, large-p settings, compared with LASSO Cox, Elastic Net Cox, and Partial Least Squares Cox, while maintaining well-calibrated null p-value distributions. In the TCGA-LAML cohort, SurvMarker achieved the best predictive performance among these methods for gene expression data, with a C-index of 0.78 and an overall time-dependent AUC of 0.882 with similar applicability to miRNA expression data. Compared with sparse PCA-based and fixed per-PC threshold approaches, SurvMarker also achieved better predictive performance and yielded more compact, stable feature sets across different PC settings.</p><p><strong>Conclusions: </strong>SurvMarker provides a robust, interpretable, and reproducible framework for identifying survival-associated molecular features from high-dimensional data. By combining survival-guided PC selection, weighted feature aggregation across PCs, and empirical null-based inference, it improves false discovery control, stability, and biological relevance, and offers a practical tool for biomarker discovery across multiple omics data types.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147855732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ronald Domi, Falko Noé, Peter Leary, Hubert Rehrauer
{"title":"GEO uploader: simplifying the data deposition in the GEO repository.","authors":"Ronald Domi, Falko Noé, Peter Leary, Hubert Rehrauer","doi":"10.1186/s12859-026-06466-4","DOIUrl":"https://doi.org/10.1186/s12859-026-06466-4","url":null,"abstract":"<p><strong>Background: </strong>The Gene Expression Omnibus (GEO) (Clough and Barrett in: methods in molecular biology, Clifton, 2016) repository requires complex multistep submissions involving metadata preparation, FTP uploads, and MD5 validation. Current manual processes are error-prone, time-consuming, and require significant bioinformatics expertise, creating barriers for many researchers.</p><p><strong>Results: </strong>We present GEO Uploader, a web-based tool that automates the entire GEO submission workflow through an intuitive interface. The application reduces the submission initiation time from 2-3 h to under 20 s by automating file uploads, MD5 calculations, and metadata template population. Key features include parallel processing of uploads and checksum calculations, automated error prevention through template-based metadata completion, real-time progress tracking, and support for complex submission structures. Deployment across 30 + users with 50 + upload sessions, including datasets exceeding hundreds of gigabytes, demonstrates practical utility and reliability in research environments.</p><p><strong>Conclusion: </strong>GEO Uploader significantly reduces the technical barrier for GEO submissions while minimizing errors through comprehensive automation. The tool supports data sharing by enabling researchers without specialized bioinformatics expertise to complete submissions independently. Available as open-source software with multiuser deployment capabilities, GEO Uploader represents a substantial improvement in research data sharing accessibility and supports broader adoption of open science practices in the genomics community.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147833107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tim Müller, Roman Hornung, Silke Szymczak, Hannes Buchner
{"title":"ShadowVIMP: permutation-based multiple testing-controlled variable selection.","authors":"Tim Müller, Roman Hornung, Silke Szymczak, Hannes Buchner","doi":"10.1186/s12859-026-06412-4","DOIUrl":"https://doi.org/10.1186/s12859-026-06412-4","url":null,"abstract":"<p><strong>Background: </strong>Identifying relevant biomarkers is critical in clinical research and precision medicine, particularly when analysing high-dimensional data. Random forests (RFs) are promising for such settings due to their flexibility, ease of use, and their ability to handle data sets with more variables than samples. RFs assess the importance of each variable in predicting the outcome using variable importance (VIMP) scores. However, since the distribution of VIMP scores is intricate, standard statistical testing and multiple testing adjustments for variable selection are challenging.</p><p><strong>Methods: </strong>We propose shadowVIMP, a novel method for multiple testing-controlled variable selection, based on an approach similar to permutation testing. It generates permuted counterparts for each variable and compares their VIMPs with those of the original variables over multiple iterations to calculate p-values. Unlike conventional permutation testing, shadowVIMP preserves the correlation structure between variables, mitigating biases caused by the over-selection of correlated variables in RFs. We evaluated shadowVIMP against three competing RF variable selection approaches using simulation designs previously employed in studies considering VIMPs and variable selection for RFs. These designs included high- and low-dimensional data, as well as correlated and categorical variables. For illustration, we also applied the method to a real-world example on Alzheimer's disease.</p><p><strong>Conclusions: </strong>Our results showed that, compared to competing approaches, shadowVIMP offers advantages in high-dimensional settings, improving sensitivity while enabling multiple testing-adjusted results. Additionally, it demonstrated robustness against VIMP biases induced by correlated and categorical variables when using permutation-based VIMP. The method can be used to annotate standard VIMP plots, visually presenting selected variable sets based on different types of multiple testing adjustments and significance levels. Overall, shadowVIMP is a promising approach for providing multiple testing-adjusted variable selection while explicitly addressing known biases of RF's permutation-based VIMP measure. The shadowVIMP method is implemented in an R package shadowVIMP, which is available on CRAN.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"27 1","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13147804/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147833133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multi-target drug design method based on target feature fusion.","authors":"Haoran Liu, Xiaoli Lin, Jing Hu, Xiaolong Zhang","doi":"10.1186/s12859-026-06449-5","DOIUrl":"https://doi.org/10.1186/s12859-026-06449-5","url":null,"abstract":"<p><strong>Background: </strong>Targeted drugs are medications designed to treat diseases by targeting specific sites on cancerous or diseased cells. Multi-target drugs can target multiple protein sites to treat diseases, improving therapeutic efficiency, but are more challenging to design. Computer-aided targeted drug design can reduce costs and shorten development time, with most drugs being single-target. Recent research on multi-target drug design has focused on optimizing single-target drugs into multi-target drugs, but this approach has limitations. This study proposes a multi-target drug design method based on protein feature fusion, which encodes and integrates features based on the target's sequence characteristics, enabling the design of multi-target drugs without prior knowledge of the targeted drug. The target protein sequences are embedded to extract features. Each target's features are independently encoded into latent vectors, while the features of multiple targets are encoded into similarity latent vectors. By leveraging both individual target features and the similarity features among targets, multi-target drugs can be efficiently designed.</p><p><strong>Results: </strong>We validated the proposed multi-target drug design method on three groups of targets: the 3CLpro and PLpro targets for COVID-19, the TAAR1 and DRD2 targets for schizophrenia, and the MEK1 and mTOR targets for tumors. The designed multi-target drugs can be docked with target proteins possessing unique molecular structures, tailored to the specific requirements of different target pocket structures. The excellent fit between the molecular structures of the multi-target drugs and the protein structures of multiple targets validates the performance of the proposed method.</p><p><strong>Conclusions: </strong>The proposed method can efficiently design multi-target drugs with stronger predicted binding affinities than those reported in previous studies. These drugs are capable of adapting to multiple targets based on the features of the target proteins. Additionally, the model demonstrates excellent generalization ability for untrained multiple targets.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147760681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A probabilistic approach for predicting indole-3-acetic acid synthesis in bacteria using genomic data.","authors":"Zheng-Xiang Ye, Steven H Wu","doi":"10.1186/s12859-026-06445-9","DOIUrl":"https://doi.org/10.1186/s12859-026-06445-9","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":" ","pages":""},"PeriodicalIF":3.3,"publicationDate":"2026-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147760694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}