{"title":"Testing and overcoming the limitations of modular response analysis.","authors":"Jean-Pierre Borg, Jacques Colinge, Patrice Ravel","doi":"10.1093/bib/bbaf098","DOIUrl":"10.1093/bib/bbaf098","url":null,"abstract":"<p><p>Modular response analysis (MRA) is an effective method to infer biological networks from perturbation data. However, it has several limitations such as strong sensitivity to noise, need of performing independent perturbations that hit a single node at a time, and linear approximation of dependencies within the network. Previously, we addressed the sensitivity of MRA to noise by reinterpreting MRA as a multilinear regression problem. We demonstrated the advantages of this approach over the conventional MRA and other known inference methods, particularly in handling noise measurements and nonlinear networks. Here, we provide new contributions to complement this theory. First, we overcome the need of perturbations to be independent, thereby augmenting MRA applicability. Second, using analysis of variance and lack-of-fit tests, we can now assess MRA compatibility with the data and identify the primary source of errors. In cases where nonlinearity prevails, we propose extending the model to a second-order polynomial. Third, we demonstrate how to effectively use prior knowledge about a network. We validated these results using 4 networks with known dynamics (3, 4, and 6 nodes) and 40 simulated networks, ranging from 10 to 200 nodes. Finally, we incorporated these innovations into our R software package MRARegress to offer a comprehensive, extended theory for MRA and to facilitate its use by the community. Mathematical aspects, tests details, and scripts are provided as Supplementary Information (see 'Data Availability Statement').</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891662/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cancer gene identification through integrating causal prompting large language model with omics data-driven causal inference.","authors":"Haolong Zeng, Chaoyi Yin, Chunyang Chai, Yuezhu Wang, Qi Dai, Huiyan Sun","doi":"10.1093/bib/bbaf113","DOIUrl":"10.1093/bib/bbaf113","url":null,"abstract":"<p><p>Identifying genes causally linked to cancer from a multi-omics perspective is essential for understanding the mechanisms of cancer and improving therapeutic strategies. Traditional statistical and machine-learning methods that rely on generalized correlation approaches to identify cancer genes often produce redundant, biased predictions with limited interpretability, largely due to overlooking confounding factors, selection biases, and the nonlinear activation function in neural networks. In this study, we introduce a novel framework for identifying cancer genes across multiple omics domains, named ICGI (Integrative Causal Gene Identification), which leverages a large language model (LLM) prompted with causality contextual cues and prompts, in conjunction with data-driven causal feature selection. This approach demonstrates the effectiveness and potential of LLMs in uncovering cancer genes and comprehending disease mechanisms, particularly at the genomic level. However, our findings also highlight that current LLMs may not capture comprehensive information across all omics levels. By applying the proposed causal feature selection module to transcriptomic datasets from six cancer types in The Cancer Genome Atlas and comparing its performance with state-of-the-art methods, it demonstrates superior capability in identifying cancer genes that distinguish between cancerous and normal samples. Additionally, we have developed an online service platform that allows users to input a gene of interest and a specific cancer type. The platform provides automated results indicating whether the gene plays a significant role in cancer, along with clear and accessible explanations. Moreover, the platform summarizes the inference outcomes obtained from data-driven causal learning methods.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11899576/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143613380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sizhe Qiu, Bozhen Hu, Jing Zhao, Weiren Xu, Aidong Yang
{"title":"Seq2Topt: a sequence-based deep learning predictor of enzyme optimal temperature.","authors":"Sizhe Qiu, Bozhen Hu, Jing Zhao, Weiren Xu, Aidong Yang","doi":"10.1093/bib/bbaf114","DOIUrl":"10.1093/bib/bbaf114","url":null,"abstract":"<p><p>An accurate deep learning predictor is needed for enzyme optimal temperature (${T}_{opt}$), which quantitatively describes how temperature affects the enzyme catalytic activity. In comparison with existing models, a new model developed in this study, Seq2Topt, reached a superior accuracy on ${T}_{opt}$ prediction just using protein sequences (RMSE = 12.26°C and R2 = 0.57), and could capture key protein regions for enzyme ${T}_{opt}$ with multi-head attention on residues. Through case studies on thermophilic enzyme selection and predicting enzyme ${T}_{opt}$ shifts caused by point mutations, Seq2Topt was demonstrated as a promising computational tool for enzyme mining and in-silico enzyme design. Additionally, accurate deep learning predictors of enzyme optimal pH (Seq2pHopt, RMSE = 0.88 and R2 = 0.42) and melting temperature (Seq2Tm, RMSE = 7.57 °C and R2 = 0.64) were developed based on the model architecture of Seq2Topt, suggesting that the development of Seq2Topt could potentially give rise to a useful prediction platform of enzymes.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904407/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolaas F V Burger, Vittorio F Nicolis, Anna-Maria Botha
{"title":"Evaluating long-read assemblers to assemble several aphididae genomes.","authors":"Nicolaas F V Burger, Vittorio F Nicolis, Anna-Maria Botha","doi":"10.1093/bib/bbaf105","DOIUrl":"10.1093/bib/bbaf105","url":null,"abstract":"<p><p>Aphids are a speciose family of the Hemiptera compromising >5500 species. They have adapted to feed off multiple plant species and occur on every continent on Earth. Although economically devastating, very few aphid genomes have been sequenced and assembled, and those that have suffer low contiguity due to repeat-rich and AT-rich genomes. With third-generation sequencing becoming more affordable and approaching quality levels to that of second-generation sequencing, the ability to produce more contiguous aphid genome assemblies is becoming a reality. With a growing list of long-read assemblers becoming available, the choice of which assembly tool to use becomes more complicated. In this study, six recently released long-read assemblers (Canu, Flye, Hifiasm, Mecat2, Raven, and Wtdbg2) were evaluated on several quality and contiguity metrics after assembling four populations (or biotypes) of the same species (Russian wheat aphid, Diuraphis noxia) and two unrelated aphid species that have publicly available long-read sequences. All assemblers did not fare equally well between the different read sets, but, overall, the Hifiasm and Canu assemblers performed the best. Merging of the best assemblies for each read set was also performed using quickmerge, where, in some cases, it resulted in superior assemblies and, in others, introduced more errors. Ab initio gene calling between assemblies of the same read set also showed surprisingly less similarity than expected. Overall, the quality control pipeline followed during the assembly resulted in chromosome-level assemblies with minimal structural or quality artefacts.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904405/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pragmatic soft-decision data readout of encoded large DNA.","authors":"Qi Ge, Rui Qin, Shuang Liu, Quan Guo, Changcai Han, Weigang Chen","doi":"10.1093/bib/bbaf102","DOIUrl":"10.1093/bib/bbaf102","url":null,"abstract":"<p><p>The encoded large DNA can be cloned and stored in vivo, capable of write-once and stable replication for multiple retrievals, offering potential in economic data archiving. Nanopore sequencing is advantageous in data access of large DNA due to its rapidity and long-read sequencing capability. However, the data readout is commonly limited by insertion and deletion (indel) errors and sequence assembly complexity. Here, a pragmatic soft-decision data readout is presented, achieving assembly-free sequence reconstruction, indel error correction, and ultra-low coverage data readout. Specifically, the watermark is cleverly embedded within large DNA fragments, allowing for the direct localization of raw reads via watermark alignment to avoid complex read assembly. A soft-decision forward-backward algorithm is proposed, which can identify indel errors and provide probability information to the error correction code, enabling error-free data recovery. Additionally, a minimum state transition is maintained, and a read segmentation is incorporated to achieve fast information reading. The readout assays for two circular plasmids (~51 kb) with different coding rates were demonstrated and achieved error-free recovery directly from noisy reads (error rate ~1%) at coverage of 1-4×. Simulations conducted on large-scale datasets across various error rates further confirm the scalability of the method and its robust performance under extreme conditions. This readout method enables nearly single-molecule recovery of large DNA, particularly suitable for rapid readout of DNA storage.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11911122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143647406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model.","authors":"Saleh Sereshki, Stefano Lonardi","doi":"10.1093/bib/bbaf092","DOIUrl":"10.1093/bib/bbaf092","url":null,"abstract":"<p><p>DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11904404/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143623602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Carolina Monzó, Maider Aguerralde-Martin, Carlos Martínez-Mira, Ángeles Arzalluz-Luque, Ana Conesa, Sonia Tarazona
{"title":"MOSim: bulk and single-cell multilayer regulatory network simulator.","authors":"Carolina Monzó, Maider Aguerralde-Martin, Carlos Martínez-Mira, Ángeles Arzalluz-Luque, Ana Conesa, Sonia Tarazona","doi":"10.1093/bib/bbaf110","DOIUrl":"https://doi.org/10.1093/bib/bbaf110","url":null,"abstract":"<p><p>As multi-omics sequencing technologies advance, the need for simulation tools capable of generating realistic and diverse (bulk and single-cell) multi-omics datasets for method testing and benchmarking becomes increasingly important. We present MOSim, an R package that simulates both bulk (via mosim function) and single-cell (via sc_mosim function) multi-omics data. The mosim function generates bulk transcriptomics data (RNA-seq) and additional regulatory omics layers (ATAC-seq, miRNA-seq, ChIP-seq, Methyl-seq, and transcription factors), while sc_mosim simulates single-cell transcriptomics data (scRNA-seq) with scATAC-seq and transcription factors as regulatory layers. The tool supports various experimental designs, including simulation of gene co-expression patterns, biological replicates, and differential expression between conditions. MOSim enables users to generate quantification matrices for each simulated omics data type, capturing the heterogeneity and complexity of bulk and single-cell multi-omics datasets. Furthermore, MOSim provides differentially abundant features within each omics layer and elucidates the active regulatory relationships between regulatory omics and gene expression data at both bulk and single-cell levels. By leveraging MOSim, researchers will be able to generate realistic and customizable bulk and single-cell multi-omics datasets to benchmark and validate analytical methods specifically designed for the integrative analysis of diverse regulatory omics data.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143673317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minfang Song, Shuai Ma, Gong Wang, Yukun Wang, Zhenzhen Yang, Bin Xie, Tongkun Guo, Xingxu Huang, Liye Zhang
{"title":"Benchmarking copy number aberrations inference tools using single-cell multi-omics datasets.","authors":"Minfang Song, Shuai Ma, Gong Wang, Yukun Wang, Zhenzhen Yang, Bin Xie, Tongkun Guo, Xingxu Huang, Liye Zhang","doi":"10.1093/bib/bbaf076","DOIUrl":"10.1093/bib/bbaf076","url":null,"abstract":"<p><p>Copy number alterations (CNAs) are an important type of genomic variation which play a crucial role in the initiation and progression of cancer. With the explosion of single-cell RNA sequencing (scRNA-seq), several computational methods have been developed to infer CNAs from scRNA-seq studies. However, to date, no independent studies have comprehensively benchmarked their performance. Herein, we evaluated five state-of-the-art methods based on their performance in tumor versus normal cell classification; CNAs profile accuracy, tumor subclone inference, and aneuploidy identification in non-malignant cells. Our results showed that Numbat outperformed others across most evaluation criteria, while CopyKAT excelled in scenarios when expression matrix alone was used as input. In specific tasks, SCEVAN showed the best performance in clonal breakpoint detection and Numbat showed high sensitivity in copy number neutral LOH (cnLOH) detection. Additionally, we investigated how referencing settings, inclusion of tumor microenvironment cells, tumor type, and tumor purity impact the performance of these tools. This study provides a valuable guideline for researchers in selecting the appropriate methods for their datasets.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11879432/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143555423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel integrative multimodal classifier to enhance the diagnosis of Parkinson's disease.","authors":"Xiaoyan Zhou, Luca Parisi, Wentao Huang, Yihan Zhang, Xiaoqun Huang, Mansour Youseffi, Farideh Javid, Renfei Ma","doi":"10.1093/bib/bbaf088","DOIUrl":"10.1093/bib/bbaf088","url":null,"abstract":"<p><p>Parkinson's disease (PD) is a complex, progressive neurodegenerative disorder with high heterogeneity, making early diagnosis difficult. Early detection and intervention are crucial for slowing PD progression. Understanding PD's diverse pathways and mechanisms is key to advancing knowledge. Recent advances in noninvasive imaging and multi-omics technologies have provided valuable insights into PD's underlying causes and biological processes. However, integrating these diverse data sources remains challenging, especially when deriving meaningful low-level features that can serve as diagnostic indicators. This study developed and validated a novel integrative, multimodal predictive model for detecting PD based on features derived from multimodal data, including hematological information, proteomics, RNA sequencing, metabolomics, and dopamine transporter scan imaging, sourced from the Parkinson's Progression Markers Initiative. Several model architectures were investigated and evaluated, including support vector machine, eXtreme Gradient Boosting, fully connected neural networks with concatenation and joint modeling (FCNN_C and FCNN_JM), and a multimodal encoder-based model with multi-head cross-attention (MMT_CA). The MMT_CA model demonstrated superior predictive performance, achieving a balanced classification accuracy of 97.7%, thus highlighting its ability to capture and leverage cross-modality inter-dependencies to aid predictive analytics. Furthermore, feature importance analysis using SHapley Additive exPlanations not only identified crucial diagnostic biomarkers to inform the predictive models in this study but also holds potential for future research aimed at integrated functional analyses of PD from a multi-omics perspective, ultimately revealing targets required for precision medicine approaches to aid treatment of PD aimed at slowing down its progression.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11891661/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143584798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GEMDiff: a diffusion workflow bridges between normal and tumor gene expression states: a breast cancer case study.","authors":"Xusheng Ai, Melissa C Smith, F Alex Feltus","doi":"10.1093/bib/bbaf093","DOIUrl":"10.1093/bib/bbaf093","url":null,"abstract":"<p><p>Breast cancer remains a significant global health challenge due to its complexity, which arises from multiple genetic and epigenetic mutations that originate in normal breast tissue. Traditional machine learning models often fall short in addressing the intricate gene interactions that complicate drug design and treatment strategies. In contrast, our study introduces GEMDiff, a novel computational workflow leveraging a diffusion model to bridge the gene expression states between normal and tumor conditions. GEMDiff augments RNAseq data and simulates perturbation transformations between normal and tumor gene states, enhancing biomarker identification. GEMDiff can handle large-scale gene expression data without succumbing to the scalability and stability issues that plague other generative models. By avoiding the need for task-specific hyper-parameter tuning and specific loss functions, GEMDiff can be generalized across various tasks, making it a robust tool for gene expression analysis. The model's ability to augment RNA-seq data and simulate gene perturbations provides a valuable tool for researchers. This capability can be used to generate synthetic data for training other machine learning models, thereby addressing the issue of limited biological data and enhancing the performance of predictive models. The effectiveness of GEMDiff is demonstrated through a case study using breast mRNA gene expression data, identifying 307 core genes involved in the transition from a breast tumor to a normal gene expression state. GEMDiff is open source and available at https://github.com/xai990/GEMDiff.git under the MIT license.</p>","PeriodicalId":9209,"journal":{"name":"Briefings in bioinformatics","volume":"26 2","pages":""},"PeriodicalIF":6.8,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11894803/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143603123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}