{"title":"Pool PaRTI: A PageRank-Based Pooling Method for Identifying Critical Residues and Enhancing Protein Sequence Representations.","authors":"Alp Tartici, Gowri Nayar, Russ B Altman","doi":"10.1093/bioinformatics/btaf330","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf330","url":null,"abstract":"<p><strong>Motivation: </strong>Protein language models produce token-level embeddings for each residue, resulting in an output matrix with dimensions that vary based on sequence length. However, downstream machine learning models typically require fixed-length input vectors, necessitating a pooling method to compress the output matrix into a single vector representation of the entire protein. Traditional pooling methods often result in substantial information loss, impacting downstream task performance. We aim to develop a pooling method that produces more expressive general-purpose protein embedding vectors while offering biological interpretability.</p><p><strong>Results: </strong>We introduce Pool PaRTI, a novel pooling method that leverages internal transformer attention matrices and PageRank to assign token importance weights. Our unsupervised and parameter-free approach consistently prioritizes residues experimentally annotated as critical for function, assigning them higher importance scores. Across four diverse protein machine learning tasks, Pool PaRTI enables significant performance gains in predictive performance. Additionally, it enhances interpretability by identifying biologically relevant regions without relying on explicit structural data or annotated training. To assess generalizability, we evaluated Pool PaRTI with two encoder-only protein language models, confirming its robustness across different models.</p><p><strong>Availability and implementation: </strong>Pool PaRTI is implemented in Python with PyTorch and is available at github.com/Helix-Research-Lab/Pool_PaRTI.git.</p><p><strong>Contact and supplementary information: </strong>The Pool PaRTI sequence embeddings and residue importance values for all human proteins on UniProt are available at zenodo.org/records/15036725 for ESM2 and protBERT. You can contact the lead author for further questions.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144201045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lujun Zhang, Lu Yang, Yingxue Ren, Shuwen Zhang, Weihua Guan, Jun Chen
{"title":"DiSC: a Statistical Tool for Fast Differential Expression Analysis of Individual-level Single-cell RNA-seq Data.","authors":"Lujun Zhang, Lu Yang, Yingxue Ren, Shuwen Zhang, Weihua Guan, Jun Chen","doi":"10.1093/bioinformatics/btaf327","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf327","url":null,"abstract":"<p><strong>Motivation: </strong>Single-cell RNA sequencing (scRNA-seq) has become an important method for characterizing cellular heterogeneity, revealing more biological insights than the bulk RNA-seq. The surge in scRNA-seq data across multiple individuals calls for efficient and statistically powerful methods for differential expression (DE) analysis that addresses individual-level biological variability.</p><p><strong>Results: </strong>We introduced DiSC, a method for conducting individual-level DE analysis by extracting multiple distributional characteristics, jointly testing their association with a variable of interest, and using a flexible permutation testing framework to control the false discovery rate (FDR). Our simulation studies demonstrated that DiSC effectively controlled the FDR across various settings and exhibited high statistical power in detecting different types of gene expression changes. Moreover, DiSC is computationally efficient and scalable to the rapidly increasing sample sizes in scRNA-seq studies. When applying DiSC to identify DE genes potentially associated with COVID-19 severity and Alzheimer's disease across various types of peripheral blood mononuclear cells and neural cells, we found that our method was approximately 100 times faster than other state-of-the-art methods and the results were consistent and supported by existing literature. While DiSC was developed for scRNA-seq data, its robust testing framework can also be applied to other types of single-cell data. We applied DiSC to cytometry by time-of-flight data, DiSC identified significantly more DE markers than traditional methods.</p><p><strong>Availability: </strong>The R software package \"SingleCellStat\" is freely available on CRAN (https://cran.r-project.org/web/packages/SingleCellStat/index.html) and GitHub (https://github.com/Lujun995/DiSC). The replication code for reproducing the analyses in this study is publicly accessible at https://github.com/Lujun995/DiSC_Replication_Code.</p><p><strong>Supplementary information: </strong>The scRNA-seq expression matrix and metadata utilized in our simulations and analyses can be retrieved from https://cells.ucsc.edu/autism/rawMatrix.zip, https://cellxgene.cziscience.com/collections/1ca90a2d-2943-483d-b678-b809bf464c30, and https://covid19.cog.sanger.ac.uk/submissions/release1/haniffa21.processed.h5ad. Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144188675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Brownian motion data augmentation: a method to push neural network performance on nanopore sensors.","authors":"Javier Kipen, Joakim Jaldén","doi":"10.1093/bioinformatics/btaf323","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf323","url":null,"abstract":"<p><strong>Motivation: </strong>Nanopores are highly sensitive sensors that have achieved commercial success in DNA/RNA sequencing, with potential applications in protein sequencing and biomarker identification. Solid-state nanopores, in particular, face challenges such as instability and low signal-to-noise ratios (SNRs), which lead scientists to adopt data-driven methods for nanopore signal analysis, although data acquisition remains restrictive.</p><p><strong>Results: </strong>We address this data scarcity by augmenting the training samples with traces that emulate Brownian motion effects, based on dynamic models in the literature. We apply this method to a publicly available dataset of a classification task containing nanopore reads of DNA with encoded barcodes. A neural network named QuipuNet was previously published for this dataset, and we demonstrate that our augmentation method produces a noticeable increase in QuipuNet's accuracy. Furthermore, we introduce a novel neural network named YupanaNet, which achieves greater accuracy (95.8%) than QuipuNet (94.6%) on the same dataset. YupanaNet benefits from both the enhanced generalization provided by Brownian motion data augmentation and the incorporation of novel architectures, including skip connections and a soft attention mask.</p><p><strong>Availability and implementation: </strong>The source code and data are available at: https://github.com/JavierKipen/browDataAug.</p><p><strong>Supplementary information: </strong>Supplementary information is available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144174607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nearl: Extracting dynamic features from molecular dynamics trajectories for machine learning tasks.","authors":"Yang Zhang, Andreas Vitalis","doi":"10.1093/bioinformatics/btaf321","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf321","url":null,"abstract":"<p><strong>Summary: </strong>Despite the rapid growth of machine learning in biomolecular applications, information about protein dynamics is underutilized. Here, we introduce Nearl, an automated pipeline designed to extract dynamic features from large ensembles of molecular dynamics (MD) trajectories. Nearl aims to identify intrinsic patterns of molecular motion and to provide informative features for predictive modelling tasks. We implement two classes of dynamic features, termed marching observers and property-density flow, to capture local atomic motions while maintaining a view of the global configuration. Complemented by standard voxelization techniques, Nearl transforms substructures of proteins into 3D grids, suitable for contemporary 3D convolutional neural networks (3D-CNNs). The pipeline leverages GPU acceleration, adheres to the FAIR principles for research software, and prioritizes flexibility and user-friendliness, allowing customization of input formats and feature extraction.</p><p><strong>Availability and implementation: </strong>The source code of Nearl is hosted at https://github.com/miemiemmmm/Nearl and archived at https://doi.org/10.5281/zenodo.15320286. The documentation is hosted on ReadTheDocs at https://nearl.readthedocs.io/en/latest/. All pre-built models are implemented in PyTorch and available on GitHub.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144175276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Zhang, Yining Hu, David Roy Smith, Zhenyu Cheng, John M Archibald
{"title":"HSDSnake: a user-friendly SnakeMake pipeline for analysis of duplicate genes in eukaryotic genomes.","authors":"Xi Zhang, Yining Hu, David Roy Smith, Zhenyu Cheng, John M Archibald","doi":"10.1093/bioinformatics/btaf325","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf325","url":null,"abstract":"<p><strong>Summary: </strong>Gene duplication is a well-known driver of molecular evolution-it acts as a source of genetic novelty, thereby, providing the raw substrate for organismal adaption. However, detecting different types of gene duplicates and comparing them in sequence datasets can be difficult. Existing tools can identify and classify gene duplicates that have arisen by various processes, but have limitations; for example, some do not have a user-friendly workflow and can include many intermediate steps requiring manual adjustments of parameters and/or are not maintained for the benefit of research community members. Here, we have developed HSDSnake, a user-friendly SnakeMake pipeline that can detect and classify gene duplications into five categories: dispersed, proximal, tandem, transposed, and whole genome. It also curates and evaluates the highly similar gene duplicates (HSDs) in each gene duplication category with reliance on both sequence similarity and conserved domains. Lastly, the detected gene duplicates can be visualized within a KEGG functional pathway framework and the substitution rates (Ka, Ks, and their Ka/Ks ratio) can be analyzed for all the duplicate gene pairs. We demonstrate HSDSnake's capabilities by analyzing two referenced genomes directly downloaded from NCBI and provide detailed instructions for each step.</p><p><strong>Availability and implementation: </strong>The HSDSnake pipeline uses SnakeMake and Conda to run and install dependencies. The distribution version is available online at GitHub: https://github.com/zx0223winner/HSDSnake and the archived version at Zenodo is https://doi.org/10.5281/zenodo.15521945.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online and at https://github.com/zx0223winner/HSDSnake/blob/main/docs/Usage.md.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144174898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matsvei Tsishyn, Pauline Hermans, Fabrizio Pucci, Marianne Rooman
{"title":"Residue conservation and solvent accessibility are (almost) all you need for predicting mutational effects in proteins.","authors":"Matsvei Tsishyn, Pauline Hermans, Fabrizio Pucci, Marianne Rooman","doi":"10.1093/bioinformatics/btaf322","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf322","url":null,"abstract":"<p><strong>Motivation: </strong>Predicting how mutations impact protein biophysical properties remains a significant challenge in computational biology. In recent years, numerous predictors, primarily deep learning models, have been developed to address this problem; however, issues such as their lack of interpretability and limited accuracy persist.</p><p><strong>Results: </strong>We showed that a simple evolutionary score, based on the log-odd ratio (LOR) of wild-type and mutated residue frequencies in evolutionary related proteins, when scaled by the residue's relative solvent accessibility (RSA), performs on par with or slightly outperforms most of the benchmarked predictors, many of which are considerably more complex. The evaluation is performed on mutations from the ProteinGym deep mutational scanning dataset collection, which measures various properties such as stability, activity or fitness. This raises further questions about what these complex models actually learn and highlights their limitations in addressing prediction of mutational landscape.</p><p><strong>Availability: </strong>The RSALOR model is available as a user-friendly Python package that can be installed from the PyPI repository. The code is freely available at https://github.com/3BioCompBio/RSALOR.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144175578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Deep Learning-based Method for Predicting the Frequency Classes of Drug Side Effects Based on Multi-Source Similarity Fusion.","authors":"Haochen Zhao, Dingxi Li, Jian Zhong, Xiao Liang, Guihua Duan, Jianxin Wang","doi":"10.1093/bioinformatics/btaf319","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf319","url":null,"abstract":"<p><strong>Motivation: </strong>Drug side effects refer to harmful or adverse reactions that occur during drug use, unrelated to the therapeutic purpose. A core issue in drug side effect prediction is determining the frequency of these drug side effects in the population, which can guide patient medication use and drug development. Many computational methods have been developed to predict the frequency of drug side effects as an alternative to clinical trials. However, existing methods typically build regression models on five frequency classes of drug side effects and tend to overfit the training set, leading to boundary handling issues and the risk of overfitting.</p><p><strong>Results: </strong>To address this problem, we develop a multi-source similarity fusion-based model, named MSSF, for predicting five frequency classes of drug side effects. Compared to existing methods, our model utilizes the multi-source feature fusion module and the self-attention mechanism to explore the relationships between drugs and side effects deeply and employs Bayesian variational inference to more accurately predict the frequency classes of drug side effects. The experimental results indicate that MSSF consistently achieves superior performance compared to existing models across multiple evaluation settings, including cross-validation, cold-start experiments, and independent testing. The visual analysis and case studies further demonstrate MSSF's reliable feature extraction capability and promise in predicting the frequency classes of drug side effects.</p><p><strong>Availability: </strong>The source code of MSSF is available on GitHub (https://github.com/dingxlcse/MSSF.git) and archived on Zenodo (DOI: 10.5281/zenodo.15462041).</p><p><strong>Supplementary information: </strong>Additional files are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144163296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"scNucMap: mapping the nucleosome landscapes at single-cell resolution.","authors":"Qianming Xiang, Binbin Lai","doi":"10.1093/bioinformatics/btaf324","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf324","url":null,"abstract":"<p><strong>Motivation: </strong>Nucleosome depletion around cis-regulatory elements (CREs) is associated with CRE activity and implies the underlying gene regulatory network. Single-cell micrococcal nuclease sequencing (scMNase-seq) enables the simultaneous measurement of nucleosome positioning and chromatin accessibility at single-cell resolution, thereby capturing cellular heterogeneity in epigenetic regulation. However, there is currently no computational tool specifically designed to decode scMNase-seq data, impeding the generation of more precise and context-dependent insights into chromatin dynamics and gene regulation.</p><p><strong>Results: </strong>Here, we present scNucMap, a tool designed to leverage the unique characteristics of scMNase-seq data to map the landscapes of candidate nucleosome-free regions (NFRs). scNucMap demonstrated superior performance and robustness in cell clustering on scMNase-seq data compared to Signac and chromVAR across diverse sample compositions and data complexities, achieving higher overall accuracy and Kappa coefficients. Additionally, scNucMap identified significant TFs associated with nucleosome depletion at CREs at both single-cell and cell-cluster levels, thereby facilitating cell-type annotation and regulatory network inference. When applied to scATAC-seq, scNucMap enriched standard analyses with complementary insights into nucleosome architecture, underscoring its cross‑modality versatility. Overall, scNucMap exhibits both high reliability and adaptability, making it an effective tool for analyzing scMNase-seq data and supporting multimodal studies, thereby illuminating the intricate relationship between regulatory networks and nucleosome positioning at single-cell resolution.</p><p><strong>Availability and implementation: </strong>scNucMap is available at https://github.com/qianming-bioinfo/scNucMap.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144163853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simple controls exceed best deep learning algorithms and reveal foundation model effectiveness for predicting genetic perturbations.","authors":"Daniel R Wong, Abby S Hill, Rob Moccia","doi":"10.1093/bioinformatics/btaf317","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf317","url":null,"abstract":"<p><strong>Motivation: </strong>Modeling genetic perturbations and their effect on the transcriptome is a key area of pharmaceutical research. Due to the complexity of the transcriptome, there has been much excitement and development in deep learning (DL) because of its ability to model complex relationships. In particular, the transformer-based foundation model paradigm emerged as the gold-standard of predicting post-perturbation responses. However, understanding these increasingly complex models and evaluating their practical utility is lacking, along with simple but appropriate benchmarks to compare predictive methods.</p><p><strong>Results: </strong>Here, we present a simple baseline method that outperforms both state of the art (SOTA) in DL and other proposed simpler neural architectures, setting a necessary benchmark to evaluate in the field of post-perturbation prediction. We also elucidate the utility of foundation models for the task of post-perturbation prediction via generalizable fine-tuning experiments that can be translated to different applications of transformer-based foundation models to tasks of interest. Furthermore, we provide a corrected version of a popular dataset used for benchmarking perturbation prediction models. Our hope is that this work will properly contextualize further development of DL models in the perturbation space with necessary control procedures.</p><p><strong>Availability and implementation: </strong>All source code is available at: https://github.com/pfizer-opensource/perturb_seq. The DOI is 10.5281/zenodo.15352937.</p><p><strong>Contact: </strong>daniel.wong@pfizer.com.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144129810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TRENDY: Gene Regulatory Network Inference Enhanced by Transformer.","authors":"Xueying Tian, Yash Patel, Yue Wang","doi":"10.1093/bioinformatics/btaf314","DOIUrl":"https://doi.org/10.1093/bioinformatics/btaf314","url":null,"abstract":"<p><strong>Motivation: </strong>Gene regulatory networks (GRNs) play a crucial role in the control of cellular functions. Numerous methods have been developed to infer GRNs from gene expression data, including mechanism-based approaches, information-based approaches, and more recent deep learning techniques, the last of which often overlook the underlying gene expression mechanisms.</p><p><strong>Results: </strong>In this work, we introduce TRENDY, a novel GRN inference method that integrates transformer models to enhance the mechanism-based WENDY approach. Through testing on both simulated and experimental datasets, TRENDY demonstrates superior performance compared to existing methods. Furthermore, we apply this transformer-based approach to three additional inference methods, showcasing its broad potential to enhance GRN inference.</p><p><strong>Availability and implementation: </strong>Code and data files are available at https://github.com/YueWangMathbio/TRENDY, with DOI : 10.6084/m9.figshare.28236074.</p><p><strong>Supplementary information: </strong>Supplementary material is available at Bioinfomatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144132950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}