Simon Schlumbohm, Julia E Neumann, Philipp Neumann
{"title":"HarmonizR: blocking and singular feature data adjustment improve runtime efficiency and data preservation.","authors":"Simon Schlumbohm, Julia E Neumann, Philipp Neumann","doi":"10.1186/s12859-025-06073-9","DOIUrl":"10.1186/s12859-025-06073-9","url":null,"abstract":"<p><strong>Background: </strong>Data adjustment is an essential tool for increasing statistical power during analysis, for example in case of complex multi-experiment data from (single-cell) RNA, proteomics and other omics data. Despite its benefits, data integration introduces internal biases-so-called batch effects. Due to the inherent presence of missing values by such methods and their additional introduction by means of data integration, renowned algorithms such as ComBat and limma are unable to perform batch effect adjustment. Recently, the HarmonizR framework was presented for these cases, which is a tool for missing value tolerant data adjustment.</p><p><strong>Results: </strong>In this contribution, we provide significant improvements to the HarmonizR approach. A novel blocking strategy is introduced to severely reduce runtime, while still supporting parallel architectures. Additionally, a \"unique removal\" strategy has been integrated into HarmonizR to maintain even more features for adjustment in datasets, showing a feature rescue of up to 103.9% for our tested datasets. In this work, we show (1) severely improved runtime for both small and large, real datasets and (2) the ability retain more features from the integrated dataset during adjustment, showing a feature rescue of up to 103.9% for our tested datasets.</p><p><strong>Conclusion: </strong>The proposed improvements tackle the previous shortcomings of the published HarmonizR version. Since HarmonizR was mainly developed for dataset integration on rare tumor entities, it did not include runtime improvements beyond parallelization, which has been addressed in this update. An additionally welcome update regarding improved feature rescue furthermore enhances the algorithms ability to quickly and robustly perform batch effect reduction.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"47"},"PeriodicalIF":2.9,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11817103/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143398011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HGATLink: single-cell gene regulatory network inference via the fusion of heterogeneous graph attention networks and transformer.","authors":"Yao Sun, Jing Gao","doi":"10.1186/s12859-025-06071-x","DOIUrl":"10.1186/s12859-025-06071-x","url":null,"abstract":"<p><strong>Background: </strong>Gene regulatory networks (GRNs) involve complex regulatory relationships between genes and play important roles in the study of various biological systems and diseases. The introduction of single-cell sequencing (scRNA-seq) technology has allowed gene regulation studies to be carried out on specific cell types, providing the opportunity to accurately infer gene regulatory networks. However, the sparsity and noise problems of single-cell sequencing data pose challenges for gene regulatory network inference, and although many gene regulatory network inference methods have been proposed, they often fail to eliminate transitive interactions or do not address multilevel relationships and nonlinear features in the graph data well.</p><p><strong>Results: </strong>On the basis of the above limitations, we propose a gene regulatory network inference framework named HGATLink. HGATLink combines the heterogeneous graph attention network and simplified transformer to capture complex interactions effectively between genes in low-dimensional space via matrix decomposition techniques, which not only enhances the ability to model complex heterogeneous graph structures and alleviate transitive interactions, but also effectively captures the long-range dependencies between genes to ensure more accurate prediction.</p><p><strong>Conclusions: </strong>Compared with 10 state-of-the-art GRN inference methods on 14 scRNA-seq datasets under two metrics, AUROC and AUPRC, HGATLink shows good stability and accuracy in gene regulatory network inference tasks.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"49"},"PeriodicalIF":2.9,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11817978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143398014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianjiao Zhang, Liang Chen, Haibin Zhu, Garry Wong
{"title":"Mammalian piRNA target prediction using a hierarchical attention model.","authors":"Tianjiao Zhang, Liang Chen, Haibin Zhu, Garry Wong","doi":"10.1186/s12859-025-06068-6","DOIUrl":"10.1186/s12859-025-06068-6","url":null,"abstract":"<p><strong>Background: </strong>Piwi-interacting RNAs (piRNAs) are well established for monitoring and protecting the genome from transposons in germline cells. Recently, numerous studies provided evidence that piRNAs also play important roles in regulating mRNA transcript levels. Despite their significant role in regulating cellular RNA levels, the piRNA targeting rules are not well defined, especially in mammals, which poses obstacles to the elucidation of piRNA function.</p><p><strong>Results: </strong>Given the complexity and current limitation in understanding the mammalian piRNA targeting rules, we designed a deep learning model by selecting appropriate deep learning sub-networks based on the targeting patterns of piRNA inferred from previous experiments. Additionally, to alleviate the problem of insufficient data, a transfer learning approach was employed. Our model achieves a good discriminatory power (Accuracy: 98.5%) in predicting an independent test dataset. Finally, this model was utilized to predict the targets of all mouse and human piRNAs available in the piRNA database.</p><p><strong>Conclusions: </strong>In this research, we developed a deep learning framework that significantly advances the prediction of piRNA targets, overcoming the limitations posed by insufficient data and current incomplete targeting rules. The piRNA target prediction network and results can be downloaded from https://github.com/SofiaTianjiaoZhang/piRNATarget .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"50"},"PeriodicalIF":2.9,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11817350/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143398020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luke Kennedy, Jagdeep K Sandhu, Mary-Ellen Harper, Miroslava Cuperlovic-Culf
{"title":"A hybrid machine learning framework for functional annotation of mitochondrial glutathione transport and metabolism proteins in cancers.","authors":"Luke Kennedy, Jagdeep K Sandhu, Mary-Ellen Harper, Miroslava Cuperlovic-Culf","doi":"10.1186/s12859-025-06051-1","DOIUrl":"10.1186/s12859-025-06051-1","url":null,"abstract":"<p><strong>Background: </strong>Alterations of metabolism, including changes in mitochondrial metabolism as well as glutathione (GSH) metabolism are a well appreciated hallmark of many cancers. Mitochondrial GSH (mGSH) transport is a poorly characterized aspect of GSH metabolism, which we investigate in the context of cancer. Existing functional annotation approaches from machine (ML) or deep learning (DL) models based only on protein sequences, were unable to annotate functions in biological contexts.</p><p><strong>Results: </strong>We develop a flexible ML framework for functional annotation from diverse feature data. This hybrid ML framework leverages cancer cell line multi-omics data and other biological knowledge data as features, to uncover potential genes involved in mGSH metabolism and membrane transport in cancers. This framework achieves strong performance across functional annotation tasks and several cell line and primary tumor cancer samples. For our application, classification models predict the known mGSH transporter SLC25A39 but not SLC25A40 as being highly probably related to mGSH metabolism in cancers. SLC25A10, SLC25A50, and orphan SLC25A24, SLC25A43 are predicted to be associated with mGSH metabolism in multiple biological contexts and structural analysis of these proteins reveal similarities in potential substrate binding regions to the binding residues of SLC25A39.</p><p><strong>Conclusion: </strong>These findings have implications for a better understanding of cancer cell metabolism and novel therapeutic targets with respect to GSH metabolism through potential novel functional annotations of genes. The hybrid ML framework proposed here can be applied to other biological function classifications or multi-omics datasets to generate hypotheses in various biological contexts. Code and a tutorial for generating models and predictions in this framework are available at: https://github.com/lkenn012/mGSH_cancerClassifiers .</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"48"},"PeriodicalIF":2.9,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11817629/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143398009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SEGT-GO: a graph transformer method based on PPI serialization and explanatory artificial intelligence for protein function prediction.","authors":"Yansong Wang, Yundong Sun, Baohui Lin, Haotian Zhang, Xiaoling Luo, Yumeng Liu, Xiaopeng Jin, Dongjie Zhu","doi":"10.1186/s12859-025-06059-7","DOIUrl":"10.1186/s12859-025-06059-7","url":null,"abstract":"<p><strong>Background: </strong>A massive amount of protein sequences have been obtained, but their functions remain challenging to discern. In recent research on protein function prediction, Protein-Protein Interaction (PPI) Networks have played a crucial role. Uncovering potential function relationships between distant proteins within PPI networks is essential for improving the accuracy of protein function prediction. Most current studies attempt to capture these distant relationships by stacking graph network layers, but performance gains diminish as the number of layers increases.</p><p><strong>Results: </strong>To further explore the potential functional relationships between multi-hop proteins in PPI networks, this paper proposes SEGT-GO, a Graph Transformer method based on PPI multi-hop neighborhood Serialization and Explainable artificial intelligence for large-scale multispecies protein function prediction. The multi-hop neighborhood serialization maps multi-hop information in the PPI Network into serialized feature embeddings, enabling the Graph Transformer to learn deeper functional features within the PPI Network. Based on game theory, the SHAP eXplainable Artificial Intelligence (XAI) framework optimizes model input and filters out feature noise, enhancing model performance.</p><p><strong>Conclusions: </strong>Compared to the advanced network method DeepGraphGO, SEGT-GO achieves more competitive results in standard large-scale datasets and superior results on small ones, validating its ability to extract functional information from deep proteins. Furthermore, SEGT-GO achieves superior results in cross-species learning and prediction of the functions of unseen proteins, further proving the method's strong generalization.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"46"},"PeriodicalIF":2.9,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11808960/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143390062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conditional similarity triplets enable covariate-informed representations of single-cell data.","authors":"Chi-Jane Chen, Haidong Yi, Natalie Stanley","doi":"10.1186/s12859-025-06069-5","DOIUrl":"10.1186/s12859-025-06069-5","url":null,"abstract":"<p><strong>Background: </strong>Single-cell technologies enable comprehensive profiling of diverse immune cell-types through the measurement of multiple genes or proteins per individual cell. In order to translate immune signatures assayed from blood or tissue into powerful diagnostics, machine learning approaches are often employed to compute immunological summaries or per-sample featurizations, which can be used as inputs to models for outcomes of interest. Current supervised learning approaches for computing per-sample representations are trained only to accurately predict a single outcome and do not take into account relevant additional clinical features or covariates that are likely to also be measured for each sample.</p><p><strong>Results: </strong>Here, we introduce a novel approach for incorporating measured covariates in optimizing model parameters to ultimately specify per-sample encodings that accurately affect both immune signatures and additional clinical information. Our introduced method CytoCoSet is a set-based encoding method for learning per-sample featurizations, which formulates a loss function with an additional triplet term penalizing samples with similar covariates from having disparate embedding results in per-sample representations.</p><p><strong>Conclusions: </strong>Overall, incorporating clinical covariates enables the learning of encodings for each individual sample that ultimately improve prediction of clinical outcome. This integration of information disparate more robust predictions of clinical phenotypes and holds significant potential for enhancing diagnostic and treatment strategies.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"45"},"PeriodicalIF":2.9,"publicationDate":"2025-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807331/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143381666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TockyPrep: data preprocessing methods for flow cytometric fluorescent timer analysis.","authors":"Masahiro Ono","doi":"10.1186/s12859-025-06058-8","DOIUrl":"10.1186/s12859-025-06058-8","url":null,"abstract":"<p><strong>Background: </strong>Fluorescent Timer proteins, which display time-dependent changes in their emission spectra, are invaluable for analyzing the temporal dynamics of cellular events at the single-cell level. We previously developed the Timer-of-cell-kinetics-and-activity (Tocky) tools, utilizing a specific Timer protein, Fast-FT, to monitor temporal changes in cellular activities. Despite their potential, the analysis of Timer fluorescence in flow cytometry is frequently compromised by variability in instrument settings and the absence of standardized preprocessing methods. The development and implementation of effective data preprocessing methods remain to be achieved.</p><p><strong>Results: </strong>In this study, we introduce the R package that automates the data preprocessing of Timer fluorescence data from flow cytometry experiments for quantitative analysis at single-cell level. Our aim is to standardize Timer data analysis to enhance reproducibility and accuracy across different experimental setups. The package includes a trigonometric transformation method to elucidate the dynamics of Fluorescent Timer proteins. We have identified the normalization of immature and mature Timer fluorescence data as essential for robust analysis, clarifying how this normalization affects the analysis of Timer maturation. These preprocessing methods are all encapsulated within the TockyPrep package.</p><p><strong>Conclusions: </strong>TockyPrep is available for distribution via GitHub at https://github.com/MonoTockyLab/TockyPrep , providing tools for data preprocessing and basic visualization of Timer fluorescence data. This toolkit is expected to enhance the utility of experimental systems utilizing Fluorescent Timer proteins, including the Tocky tools.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"44"},"PeriodicalIF":2.9,"publicationDate":"2025-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11807314/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143373697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas M Adams, Moray Smith, Yuhan Wang, Lynn H Brown, Micha M Bayer, Ingo Hein
{"title":"Correction: HISS: Snakemake-based workflows for performing SMRT-RenSeq assembly, AgRenSeq and dRenSeq for the discovery of novel plant disease resistance genes.","authors":"Thomas M Adams, Moray Smith, Yuhan Wang, Lynn H Brown, Micha M Bayer, Ingo Hein","doi":"10.1186/s12859-024-06014-y","DOIUrl":"10.1186/s12859-024-06014-y","url":null,"abstract":"","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"43"},"PeriodicalIF":2.9,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11806890/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143370128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Direct coupling analysis and the attention mechanism.","authors":"Francesco Caredda, Andrea Pagnani","doi":"10.1186/s12859-025-06062-y","DOIUrl":"10.1186/s12859-025-06062-y","url":null,"abstract":"<p><p>Proteins are involved in nearly all cellular functions, encompassing roles in transport, signaling, enzymatic activity, and more. Their functionalities crucially depend on their complex three-dimensional arrangement. For this reason, being able to predict their structure from the amino acid sequence has been and still is a phenomenal computational challenge that the introduction of AlphaFold solved with unprecedented accuracy. However, the inherent complexity of AlphaFold's architectures makes it challenging to understand the rules that ultimately shape the protein's predicted structure. This study investigates a single-layer unsupervised model based on the attention mechanism. More precisely, we explore a Direct Coupling Analysis (DCA) method that mimics the attention mechanism of several popular Transformer architectures, such as AlphaFold itself. The model's parameters, notably fewer than those in standard DCA-based algorithms, can be directly used for extracting structural determinants such as the contact map of the protein family under study. Additionally, the functional form of the energy function of the model enables us to deploy a multi-family learning strategy, allowing us to effectively integrate information across multiple protein families, whereas standard DCA algorithms are typically limited to single protein families. Finally, we implemented a generative version of the model using an autoregressive architecture, capable of efficiently generating new proteins in silico.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"41"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11804077/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Guan, Guangdi Chu, Ziying Wang, Jianmin Li, Bo Yi
{"title":"Instance-level semantic segmentation of nuclei based on multimodal structure encoding.","authors":"Bo Guan, Guangdi Chu, Ziying Wang, Jianmin Li, Bo Yi","doi":"10.1186/s12859-025-06066-8","DOIUrl":"10.1186/s12859-025-06066-8","url":null,"abstract":"<p><strong>Background: </strong>Accurate segmentation and classification of cell nuclei are crucial for histopathological image analysis. However, existing deep neural network-based methods often struggle to capture complex morphological features and global spatial distributions of cell nuclei due to their reliance on local receptive fields.</p><p><strong>Methods: </strong>This study proposes a graph neural structure encoding framework based on a vision-language model. The framework incorporates: (1) A multi-scale feature fusion and knowledge distillation module utilizing the Contrastive Language-Image Pre-training (CLIP) model's image encoder; (2) A method to transform morphological features of cells into textual descriptions for semantic representation; and (3) A graph neural network approach to learn spatial relationships and contextual information between cell nuclei.</p><p><strong>Results: </strong>Experimental results demonstrate that the proposed method significantly improves the accuracy of cell nucleus segmentation and classification compared to existing approaches. The framework effectively captures complex nuclear structures and global distribution features, leading to enhanced performance in histopathological image analysis.</p><p><strong>Conclusions: </strong>By deeply mining the morphological features of cell nuclei and their spatial topological relationships, our graph neural structure encoding framework achieves high-precision nuclear segmentation and classification. This approach shows significant potential for enhancing histopathological image analysis, potentially leading to more accurate diagnoses and improved understanding of cellular structures in pathological tissues.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"42"},"PeriodicalIF":2.9,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11804060/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}