{"title":"Combined Topological Data Analysis and Geometric Deep Learning Reveal Niches by the Quantification of Protein Binding Pockets.","authors":"Peiran Jiang, Jose Lugo-Martinez","doi":"10.1089/cmb.2025.0076","DOIUrl":"https://doi.org/10.1089/cmb.2025.0076","url":null,"abstract":"<p><p>Protein pockets are essential for many proteins to carry out their functions. Locating and measuring protein pockets, as well as studying the anatomy of pockets, helps us further understand protein function. Most research studies focus on learning either local or global information from protein structures. However, there is a lack of studies that leverage the power of integrating both local and global representations of these structures. In this work, we combine topological data analysis (TDA) and geometric deep learning (GDL) to analyze the putative protein pockets of enzymes. TDA captures blueprints of the global topological invariant of protein pockets, whereas GDL decomposes the fingerprints into building blocks of these pockets. This integration of local and global views provides a comprehensive and complementary understanding of the protein structural motifs (<i>niches</i> for short) within protein pockets. We also analyze the distribution of the building blocks making up the pocket and profile the predictive power of coupling local and global representations for the task of discriminating between enzymes and nonenzymes, as well as predicting the enzyme class. We demonstrate that our representation learning framework for macromolecules is particularly useful when the structure is known, and the scenarios heavily rely on local and global information.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144174109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu
{"title":"Effective Integration of Single-Cell Multi-Omics Data Using Improved Network-Based Integrative Clustering with Multigraph Regularization.","authors":"Shunqin Zhang, Wei Kong, Shuaiqun Wang, Kai Wei, Kun Liu, Gen Wen, Yaling Yu","doi":"10.1089/cmb.2023.0460","DOIUrl":"https://doi.org/10.1089/cmb.2023.0460","url":null,"abstract":"<p><p>The purpose of integrating different omics data is to study cellular heterogeneity at the level of transcriptional regulation from different gene levels, which can effectively identify cell types and reveal the pathogenesis of Alzheimer's disease (AD) from two perspectives. However, implementing such algorithms faces challenges such as high data noise levels, increased dimensionality, and computational complexity. In this study, multigraph regularization constraints were introduced in the network-based integrative clustering algorithm (MGR-NIC) to remove redundant features and keep the geometry structures underlying the data by fusing two types of data (snRNA-seq and snATAC-seq) of glial cells from AD samples. The effectiveness of the MGR-NIC algorithm was validated using both simulation datasets and real datasets derived from various tissues. The MGR-NIC algorithm can improve clustering accuracy by selecting features that better represent the dataset's structure. The clustering results obtained with the MGR-NIC algorithm show strong consistency with the clustering results inherent to the published DLPFC dataset, while the classification results generated using the NIC algorithm often lead to cluster overlap when applied to the DLPFC dataset. We will use the same state-of-the-art algorithms for a comprehensive evaluation with our proposed MGR-NIC algorithm, including NIC, scAI, Multi-Omics Factor Analysis v2, and JSNMF. MGR-NIC is the most stable and reliable method, implying its robustness across different datasets and its reliability in yielding consistent and accurate results.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Traditional and Deep Machine Learning to Predict Emergency Room Triage Levels.","authors":"Mehmet Yıldırım, Savaş Sezik, Ayşe Başar","doi":"10.1089/cmb.2024.0632","DOIUrl":"https://doi.org/10.1089/cmb.2024.0632","url":null,"abstract":"<p><p>Accurate triage in emergency rooms is crucial for efficient patient care and resource allocation. We developed methods to predict triage levels using several traditional machine learning methods (logistic regression, random forest, XGBoost) and neural network deep learning-based approaches. These models were tested on a dataset from emergency department visits of patients at a local Turkish hospital; this dataset consists of both structured and unstructured data. Compared with previous work, our challenge was to build a predictive model that uses documents written in the Turkish language and that handles specific aspects of the Turkish medical system. Text embedding techniques such as Bag of Words, Word2Vec, and BERT-based embedding were used to process the unstructured patient complaints. We used a comprehensive set of features including patient history data and disease diagnosis within our predictive models, which included advanced neural network architectures such as convolutional neural networks, attention mechanisms, and long-short-term memory networks. Our results revealed that BERT embeddings significantly enhanced the performance of neural network models, while Word2Vec embeddings showed slight better results in traditional machine learning models. The most effective model was XGBoost combined with Word2Vec embeddings, achieving 86.7% AUC, 81.5% accuracy, and 68.7% weighted F1 score. We conclude that text embedding methods and machine learning methods are effective tools to predict emergency room triage levels. The integration of patient history into the models, alongside the strategic use of text embeddings, significantly improves predictive accuracy.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144119823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples.","authors":"Shorya Consul, John Robertson, Haris Vikalo","doi":"10.1089/cmb.2025.0075","DOIUrl":"https://doi.org/10.1089/cmb.2025.0075","url":null,"abstract":"<p><p>It is estimated that approximately 15% of cancers worldwide can be linked to viral infections. The viruses that can cause or increase the risk of cancer include human papillomavirus, hepatitis B and C viruses, Epstein-Barr virus, and human immunodeficiency virus, to name a few. The computational analysis of the massive amounts of tumor DNA data, whose collection is enabled by the advancements in sequencing technologies, has allowed studies of the potential association between cancers and viral pathogens. However, the high diversity of oncoviral families makes reliable detection of viral DNA difficult, and the training of machine learning models that enable such analysis computationally challenging. We introduce XVir, a data pipeline that deploys a transformer-based deep learning architecture to reliably identify viral DNA present in human tumors. XVir is trained on a mix of sequencing reads coming from viral and human genomes, resulting in a model capable of robust detection of potentially mutated viral DNA across a range of experimental settings. Results on semi-experimental data demonstrate that XVir is able to achieve high classification accuracy, generally outperforming state-of-the-art competing methods. In particular, it retains high accuracy even when faced with diverse viral populations while being significantly faster to train than other large deep learning-based classifiers.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144110335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge.","authors":"Gilchan Park, Byung-Jun Yoon, Xihaier Luo, Vanessa López-Marrero, Shinjae Yoo, Shantenu Jha","doi":"10.1089/cmb.2025.0078","DOIUrl":"https://doi.org/10.1089/cmb.2025.0078","url":null,"abstract":"<p><p>Understanding the interactions and regulatory relationships among biomolecules is essential for deciphering complex biological systems and elucidating the mechanisms behind diverse biological functions. Traditionally, the collection of such molecular interaction data has relied on expert curation, a process that is both time-consuming and labor-intensive. To address these limitations, this study explores the use of large language models (LLMs) to automate the genome-scale extraction of molecular interaction knowledge. We evaluate the performance of various LLMs on key biological tasks, including the identification of protein-protein interactions, detection of genes associated with pathways influenced by low-dose radiation, and inference of gene regulatory relationships. Our findings demonstrate that larger LLMs tend to perform better, particularly in extracting intricate gene and protein interactions. Despite their strengths, these models face challenges in recognizing functionally diverse gene groups and highly correlated regulatory relationships. Through a comprehensive analysis using established molecular interaction and pathway databases, we show that LLMs possess the potential to identify relevant biomolecules and predict their interactions, offering valuable insights and marking a significant step toward AI-driven biological knowledge discovery.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144093858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mukul S Bansal, Wei Chen, Yury Khudyakov, Ion I Măndoiu, Marmar R Moussa, Murray Patterson, Sanguthevar Rajasekaran, Pavel Skums, Sharma V Thankachan, Alex Zelikovsky
{"title":"<i>Special Section:</i> 12th International Computational Advances in Bio and Medical Sciences (ICCABS 2023).","authors":"Mukul S Bansal, Wei Chen, Yury Khudyakov, Ion I Măndoiu, Marmar R Moussa, Murray Patterson, Sanguthevar Rajasekaran, Pavel Skums, Sharma V Thankachan, Alex Zelikovsky","doi":"10.1089/cmb.2025.0124","DOIUrl":"https://doi.org/10.1089/cmb.2025.0124","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144078209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The 2nd International Workshop on Pattern Recognition in Healthcare Analytics 2023 Preface.","authors":"Inci M Baytas","doi":"10.1089/cmb.2025.0117","DOIUrl":"https://doi.org/10.1089/cmb.2025.0117","url":null,"abstract":"","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143985486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao
{"title":"Exploring the Influence of Gene Networks on Driver Gene Classification.","authors":"Paulo Henrique Ribeiro, Jorge Francisco Cutigi, Rodrigo Henrique Ramos, Cynthia de Oliveira Lage Ferreira, Adriane Feijo Evangelista, Adenilso da Silva Simao","doi":"10.1089/cmb.2025.0043","DOIUrl":"https://doi.org/10.1089/cmb.2025.0043","url":null,"abstract":"<p><p>Cancer is a complex disease caused by mutations in the genome of cells. Genetic mutations can be divided into driver mutations, which are significant for the initiation and progression of cancer, and passenger mutations, which have a neutral effect. In recent years, computational methods have been developed to identify driver genes. Some of these methods use data from gene networks to classify the genes. However, the impact of different gene networks on the performance of these methods remains unexplored. This article aims to analyze the influence of genetic networks in driver gene classification. We analyzed driver gene classification methods that use gene networks as input data, using different cancer mutation datasets and distinct gene networks. Computational methods show significant variation in their results when different gene networks are employed. The results highlight the need to carefully interpret driver gene classification and emphasize the importance of using different gene networks. These findings underline the necessity of developing more robust computational approaches that account for network variability, ensuring greater reliability in driver gene identification and its applications in cancer research.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143994363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DrIVeNN: Drug Interaction Vectors Neural Network.","authors":"Natalie Wang, Casey Overby Taylor","doi":"10.1089/cmb.2025.0079","DOIUrl":"https://doi.org/10.1089/cmb.2025.0079","url":null,"abstract":"<p><p>Polypharmacy, the concurrent use of multiple drugs to treat a single condition, is common in patients managing multiple or complex conditions. However, as more drugs are added to the treatment plan, the risk of adverse drug events (ADEs) rises rapidly. Because it is impractical to test every possible drug combination during clinical trials, many serious polypharmacy ADEs (also known as drug-drug interactions or DDIs) only become known after the drugs are in use. This issue is prevalent among older adults with cardiovascular disease (CVD), where polypharmacy and ADEs are common. In this research, our primary objective was to identify key drug features and build and evaluate a model to predict DDIs. Our secondary objective was to assess our model on a domain-specific case study. We developed a two-layer neural network that incorporated drug features such as molecular structure, drug-protein interactions, and mono-drug side effects (drug interaction vectors neural network [DrIVeNN]) using publicly available side effect databases. It performed moderately better than state-of-the-art models such as DGNN-DDI, KGDDI, and NNPS. DrIVeNN had average area under the Receiver Operating Characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) scores of 0.934 and 0.920, respectively, compared to the best-performing baseline model, DGNN-DDI, which had scores of 0.919 and 0.904. We also conducted a domain-specific case study centered on CVD treatment, and there was a significant increase in performance from the general model. We observed an average AUROC for CVD DDI prediction of 0.979. This research contributes to the advancement of predictive modeling techniques for polypharmacy ADEs and indicates the strong potential of domain-specific models.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144027348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas
{"title":"Generating Heterogeneous Data on Gene Trees.","authors":"Martí Cortada Garcia, Adrià Diéguez Moscardó, Marta Casanellas","doi":"10.1089/cmb.2024.0843","DOIUrl":"https://doi.org/10.1089/cmb.2024.0843","url":null,"abstract":"<p><p>We introduce GenPhylo, a Python module that simulates nucleotide sequence data along a phylogeny avoiding the restriction of continuous-time Markov processes. GenPhylo uses directly a general Markov model and therefore naturally incorporates heterogeneity across lineages. We solve the challenge of generating transition matrices with a pre-given expected number of substitutions (the branch length information) by providing an algorithm that can be incorporated in other simulation software.</p>","PeriodicalId":15526,"journal":{"name":"Journal of Computational Biology","volume":" ","pages":""},"PeriodicalIF":1.4,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143972956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}