{"title":"Optimizing sample size for supervised machine learning with bulk transcriptomic sequencing: a learning curve approach.","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"10.1093/bib/bbaf097","DOIUrl":"10.1093/bib/bbaf097","url":null,"abstract":"<p><p>Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate classification accuracy without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the accuracy-versus-sample size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899567/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143613400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Annette Spooner, Mohammad Karimi Moridani, Barbra Toplis, Jason Behary, Azadeh Safarchi, Salim Maher, Fatemeh Vafaee, Amany Zekry, Arcot Sowmya
{"title":"Benchmarking ensemble machine learning algorithms for multi-class, multi-omics data integration in clinical outcome prediction.","authors":"Annette Spooner, Mohammad Karimi Moridani, Barbra Toplis, Jason Behary, Azadeh Safarchi, Salim Maher, Fatemeh Vafaee, Amany Zekry, Arcot Sowmya","doi":"10.1093/bib/bbaf116","DOIUrl":"10.1093/bib/bbaf116","url":null,"abstract":"<p><p>The complementary information found in different modalities of patient data can aid in more accurate modelling of a patient's disease state and a better understanding of the underlying biological processes of a disease. However, the analysis of multi-modal, multi-omics data presents many challenges. In this work, we compare the performance of a variety of ensemble machine learning (ML) algorithms that are capable of late integration of multi-class data from different modalities. The ensemble methods and their variations tested were (i) a voting ensemble, with hard and soft vote, (ii) a meta learner, and (iii) a multi-modal AdaBoost model using hard vote, soft vote, and meta learner to integrate the modalities on each boosting round, the PB-MVBoost model and a novel application of a mixture of expert's model. These were compared to simple concatenation. We examine these methods using data from an in-house study on hepatocellular carcinoma, plus validation datasets on studies from breast cancer and irritable bowel disease. We develop models that achieve an area under the receiver operating curve of up to 0.85 and find that two boosted methods, PB-MVBoost and AdaBoost with soft vote were the best performing models. We also examine the stability of features selected and the size of the clinical signature. Our work shows that integrating complementary omics and data modalities with effective ensemble ML models enhances accuracy in multi-class clinical outcome predictions and produces more stable predictive features than individual modalities or simple concatenation. We provide recommendations for the integration of multi-modal multi-class data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11926982/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BAMBI integrates biostatistical and artificial intelligence methods to improve RNA biomarker discovery.","authors":"Peng Zhou, Zixiu Li, Feifan Liu, Euijin Kwon, Tien-Chan Hsieh, Shangyuan Ye, Shobha Vasudevan, Jung Ae Lee, Khanh-Van Tran, Chan Zhou","doi":"10.1093/bib/bbaf073","DOIUrl":"10.1093/bib/bbaf073","url":null,"abstract":"<p><p>RNA biomarkers enable early and precise disease diagnosis, monitoring, and prognosis, facilitating personalized medicine and targeted therapeutic strategies. However, identification of RNA biomarkers is hindered by the challenge of analyzing relatively small yet high-dimensional transcriptomics datasets, typically comprising fewer than 1000 biospecimens but encompassing hundreds of thousands of RNAs, especially noncoding RNAs. This complexity leads to several limitations in existing methods, such as poor reproducibility on independent datasets, inability to directly process omics data, and difficulty in identifying noncoding RNAs as biomarkers. Additionally, these methods often yield results that lack biological interpretation and clinical utility. To overcome these challenges, we present BAMBI (Biostatistical and Artificial-intelligence Methods for Biomarker Identification), a computational tool integrating biostatistical approaches and machine-learning algorithms. By initially reducing high dimensionality through biologically informed statistical methods followed by machine learning-based feature selection, BAMBI significantly enhances the accuracy and clinical utility of identified RNA biomarkers and also includes noncoding RNA biomarkers that existing methods may overlook. BAMBI outperformed existing methods on both real and simulated datasets by identifying individual and panel biomarkers with fewer RNAs while still ensuring superior prediction accuracy. BAMBI was benchmarked on multiple transcriptomics datasets across diseases, including breast cancer, psoriasis, and leukemia. The prognostic biomarkers for acute myeloid leukemia discovered by BAMBI showed significant correlations with patient survival rates in an independent cohort, highlighting its potential for enhancing clinical outcomes. The software is available on GitHub (https://github.com/CZhouLab/BAMBI).</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11929966/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143691266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jilong Bian, Hao Lu, Limin Wei, Yang Li, Guohua Wang
{"title":"Relational similarity-based graph contrastive learning for DTI prediction.","authors":"Jilong Bian, Hao Lu, Limin Wei, Yang Li, Guohua Wang","doi":"10.1093/bib/bbaf122","DOIUrl":"10.1093/bib/bbaf122","url":null,"abstract":"<p><p>As part of the drug repurposing process, it is imperative to predict the interactions between drugs and target proteins in an accurate and efficient manner. With the introduction of contrastive learning into drug-target prediction, the accuracy of drug repurposing will be further improved. However, a large part of DTI prediction methods based on deep learning either focus only on the structural features of proteins and drugs extracted using GNN or CNN, or focus only on their relational features extracted using heterogeneous graph neural networks on a DTI heterogeneous graph. Since the structural and relational features of proteins and drugs describe their attribute information from different perspectives, their combination can improve DTI prediction performance. We propose a relational similarity-based graph contrastive learning for DTI prediction (RSGCL-DTI), which combines the structural and relational features of drugs and proteins to enhance the accuracy of DTI predictions. In our proposed method, the inter-protein relational features and inter-drug relational features are extracted from the heterogeneous drug-protein interaction network through graph contrastive learning, respectively. The results demonstrate that combining the relational features obtained by graph contrastive learning with the structural ones extracted by D-MPNN and CNN enhances feature representation ability, thereby improving DTI prediction performance. Our proposed RSGCL-DTI outperforms eight SOTA baseline models on the four benchmark datasets, performs well on the imbalanced dataset, and also shows excellent generalization ability on unseen drug-protein pairs.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11932091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143699627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DRAG: design RNAs as hierarchical graphs with reinforcement learning.","authors":"Yichong Li, Xiaoyong Pan, Hongbin Shen, Yang Yang","doi":"10.1093/bib/bbaf106","DOIUrl":"10.1093/bib/bbaf106","url":null,"abstract":"<p><p>The rapid development of RNA vaccines and therapeutics puts forward intensive requirements on the sequence design of RNAs. RNA sequence design, or RNA inverse folding, aims to generate RNA sequences that can fold into specific target structures. To date, efficient and high-accuracy prediction models for secondary structures of RNAs have been developed. They provide a basis for computational RNA sequence design methods. Especially, reinforcement learning (RL) has emerged as a promising approach for RNA design due to its ability to learn from trial and error in generation tasks and work without ground truth data. However, existing RL methods are limited in considering complex hierarchical structures in RNA design environments. To address the above limitation, we propose DRAG, an RL method that builds design environments for target secondary structures with hierarchical division based on graph neural networks. Through extensive experiments on benchmark datasets, DRAG exhibits remarkable performance compared with current machine-learning approaches for RNA sequence design. This advantage is particularly evident in long and intricate tasks involving structures with significant depth.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904406/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pengfei Zhu, Han Shu, Yongtian Wang, Xiaofeng Wang, Yuan Zhao, Jialu Hu, Jiajie Peng, Xuequn Shang, Zhen Tian, Jing Chen, Tao Wang
{"title":"MAEST: accurately spatial domain detection in spatial transcriptomics with graph masked autoencoder.","authors":"Pengfei Zhu, Han Shu, Yongtian Wang, Xiaofeng Wang, Yuan Zhao, Jialu Hu, Jiajie Peng, Xuequn Shang, Zhen Tian, Jing Chen, Tao Wang","doi":"10.1093/bib/bbaf086","DOIUrl":"10.1093/bib/bbaf086","url":null,"abstract":"<p><p>Spatial transcriptomics (ST) technology provides gene expression profiles with spatial context, offering critical insights into cellular interactions and tissue architecture. A core task in ST is spatial domain identification, which involves detecting coherent regions with similar spatial expression patterns. However, existing methods often fail to fully exploit spatial information, leading to limited representational capacity and suboptimal clustering accuracy. Here, we introduce MAEST, a novel graph neural network model designed to address these limitations in ST data. MAEST leverages graph masked autoencoders to denoise and refine representations while incorporating graph contrastive learning to prevent feature collapse and enhance model robustness. By integrating one-hop and multi-hop representations, MAEST effectively captures both local and global spatial relationships, improving clustering precision. Extensive experiments across diverse datasets, including the human brain, mouse hippocampus, olfactory bulb, brain, and embryo, demonstrate that MAEST outperforms seven state-of-the-art methods in spatial domain identification. Furthermore, MAEST showcases its ability to integrate multi-slice data, identifying joint domains across horizontal tissue sections with high accuracy. These results highlight MAEST's versatility and effectiveness in unraveling the spatial organization of complex tissues. The source code of MAEST can be obtained at https://github.com/clearlove2333/MAEST.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11886571/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143572161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Gao, Yansen Su, Jin Tang, Huaiwan Jin, Yun Ding, Rui-Fen Cao, Pi-Jing Wei, Chun-Hou Zheng
{"title":"AttentionGRN: a functional and directed graph transformer for gene regulatory network reconstruction from scRNA-seq data.","authors":"Zhen Gao, Yansen Su, Jin Tang, Huaiwan Jin, Yun Ding, Rui-Fen Cao, Pi-Jing Wei, Chun-Hou Zheng","doi":"10.1093/bib/bbaf118","DOIUrl":"10.1093/bib/bbaf118","url":null,"abstract":"<p><p>Single-cell RNA sequencing (scRNA-seq) enables the reconstruction of cell type-specific gene regulatory networks (GRNs), offering detailed insights into gene regulation at high resolution. While graph neural networks have become widely used for GRN inference, their message-passing mechanisms are often limited by issues such as over-smoothing and over-squashing, which hinder the preservation of essential network structure. To address these challenges, we propose a novel graph transformer-based model, AttentionGRN, which leverages soft encoding to enhance model expressiveness and improve the accuracy of GRN inference from scRNA-seq data. Furthermore, the GRN-oriented message aggregation strategies are designed to capture both the directed network structure information and functional information inherent in GRNs. Specifically, we design directed structure encoding to facilitate the learning of directed network topologies and employ functional gene sampling to capture key functional modules and global network structure. Our extensive experiments, conducted on 88 datasets across two distinct tasks, demonstrate that AttentionGRN consistently outperforms existing methods. Furthermore, AttentionGRN has been successfully applied to reconstruct cell type-specific GRNs for human mature hepatocytes, revealing novel hub genes and previously unidentified transcription factor-target gene regulatory associations.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11926986/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongfei Hu, Xinyu Li, Ying Yi, Yan Huang, Guangyu Wang, Dong Wang
{"title":"Deep learning-driven survival prediction in pan-cancer studies by integrating multimodal histology-genomic data.","authors":"Yongfei Hu, Xinyu Li, Ying Yi, Yan Huang, Guangyu Wang, Dong Wang","doi":"10.1093/bib/bbaf121","DOIUrl":"10.1093/bib/bbaf121","url":null,"abstract":"<p><p>Accurate cancer prognosis is essential for personalized clinical management, guiding treatment strategies and predicting patient survival. Conventional methods, which depend on the subjective evaluation of histopathological features, exhibit significant inter-observer variability and limited predictive power. To overcome these limitations, we developed cross-attention transformer-based multimodal fusion network (CATfusion), a deep learning framework that integrates multimodal histology-genomic data for comprehensive cancer survival prediction. By employing self-supervised learning strategy with TabAE for feature extraction and utilizing cross-attention mechanisms to fuse diverse data types, including mRNA-seq, miRNA-seq, copy number variation, DNA methylation variation, mutation data, and histopathological images. By successfully integrating this multi-tiered patient information, CATfusion has become an advanced survival prediction model to utilize the most diverse data types across various cancer types. CATfusion's architecture, which includes a bidirectional multimodal attention mechanism and self-attention block, is adept at synchronizing the learning and integration of representations from various modalities. CATfusion achieves superior predictive performance over traditional and unimodal models, as demonstrated by enhanced C-index and survival area under the curve scores. The model's high accuracy in stratifying patients into distinct risk groups is a boon for personalized medicine, enabling tailored treatment plans. Moreover, CATfusion's interpretability, enabled by attention-based visualization, offers insights into the biological underpinnings of cancer prognosis, underscoring its potential as a transformative tool in oncology.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11926983/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PCLSurv: a prototypical contrastive learning-based multi-omics data integration model for cancer survival prediction.","authors":"Zhimin Li, Wenlan Chen, Hai Zhong, Cheng Liang","doi":"10.1093/bib/bbaf124","DOIUrl":"10.1093/bib/bbaf124","url":null,"abstract":"<p><p>Accurate cancer survival prediction remains a critical challenge in clinical oncology, largely due to the complex and multi-omics nature of cancer data. Existing methods often struggle to capture the comprehensive range of informative features required for precise predictions. Here, we introduce PCLSurv, an innovative deep learning framework designed for cancer survival prediction using multi-omics data. PCLSurv integrates autoencoders to extract omics-specific features and employs sample-level contrastive learning to identify distinct yet complementary characteristics across data views. Then, features are fused via a bilinear fusion module to construct a unified representation. To further enhance the model's capacity to capture high-level semantic relationships, PCLSurv aligns similar samples with shared prototypes while separating unrelated ones via prototypical contrastive learning. As a result, PCLSurv effectively distinguishes patient groups with varying survival outcomes at different semantic similarity levels, providing a robust framework for stratifying patients based on clinical and molecular features. We conduct extensive experiments on 11 cancer datasets. The comparison results confirm the superior performance of PCLSurv over existing alternatives. The source code of PCLSurv is freely available at https://github.com/LiangSDNULab/PCLSurv.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11932092/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143699503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Regularly updated benchmark sets for statistically correct evaluations of AlphaFold applications.","authors":"Laszlo Dobson, Gábor E Tusnády, Peter Tompa","doi":"10.1093/bib/bbaf104","DOIUrl":"10.1093/bib/bbaf104","url":null,"abstract":"<p><p>AlphaFold2 changed structural biology by providing high-quality structure predictions for all possible proteins. Since its inception, a plethora of applications were built on AlphaFold2, expediting discoveries in virtually all areas related to protein science. In many cases, however, optimism seems to have made scientists forget about data leakage, a serious issue that needs to be addressed when evaluating machine learning methods. Here we provide a rigorous benchmark set that can be used in a broad range of applications built around AlphaFold2/3.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11894802/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143603126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}