Silvia Cascianelli, Iva Milojkovic, Marco Masseroli
{"title":"A novel machine learning-based workflow to capture intra-patient heterogeneity through transcriptional multi-label characterization and clinically relevant classification","authors":"Silvia Cascianelli, Iva Milojkovic, Marco Masseroli","doi":"10.1016/j.jbi.2025.104817","DOIUrl":"10.1016/j.jbi.2025.104817","url":null,"abstract":"<div><h3>Objectives:</h3><div>Patient classification into specific molecular subtypes is paramount in biomedical research and clinical practice to face complex, heterogeneous diseases. Existing methods, especially for gene expression-based cancer subtyping, often simplify patient molecular portraits, neglecting the potential co-occurrence of traits from multiple subtypes. Yet, recognizing intra-sample heterogeneity is essential for more precise patient characterization and improved personalized treatments.</div></div><div><h3>Methods:</h3><div>We developed a novel computational workflow, named MULTI-STAR, which addresses current limitations and provides tailored solutions for reliable multi-label patient subtyping. MULTI-STAR uses state-of-the-art subtyping methods to obtain promising machine learning-based multi-label classifiers, leveraging gene expression profiles. It modifies standard single-label similarity-based techniques to obtain multi-label patient characterizations. Then, it employs these characterizations to train single-sample predictors using different multi-label strategies and find the best-performing classifiers.</div></div><div><h3>Results:</h3><div>MULTI-STAR classifiers offer advanced multi-label recognition of all the subtypes contributing to the molecular and clinical traits of a patient, also distinguishing the primary from the additional relevant secondary subtype(s). The efficacy was demonstrated by developing multi-label solutions for breast and colorectal cancer subtyping that outperform existing methods in terms of prognostic value, primarily for overall survival predictions, and ability to work on a single sample at a time, as required in clinical practice.</div></div><div><h3>Conclusions:</h3><div>This work emphasizes the importance of moving to multi-label subtyping to capture all the molecular traits of individual patients, considering also previously overlooked secondary assignments and paving the way for improved clinical decision-making processes in diverse heterogeneous disease contexts. Indeed, MULTI-STAR novel, reproducible and generalizable approach provides comprehensive representations of patient inner heterogeneity and clinically relevant insights, contributing to precision medicine and personalized treatments.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"166 ","pages":"Article 104817"},"PeriodicalIF":4.0,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143816805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco J. Lara-Abelenda , David Chushig-Muzo , Pablo Peiro-Corbacho , Vanesa Gómez-Martínez , Ana M. Wägner , Conceição Granja , Cristina Soguero-Ruiz
{"title":"Transfer learning for a tabular-to-image approach: A case study for cardiovascular disease prediction","authors":"Francisco J. Lara-Abelenda , David Chushig-Muzo , Pablo Peiro-Corbacho , Vanesa Gómez-Martínez , Ana M. Wägner , Conceição Granja , Cristina Soguero-Ruiz","doi":"10.1016/j.jbi.2025.104821","DOIUrl":"10.1016/j.jbi.2025.104821","url":null,"abstract":"<div><h3>Objective:</h3><div>Machine learning (ML) models have been extensively used for tabular data classification but recent works have been developed to transform tabular data into images, aiming to leverage the predictive performance of convolutional neural networks (CNNs). However, most of these approaches fail to convert data with a low number of samples and mixed-type features. This study aims: to evaluate the performance of the tabular-to-image method named low mixed-image generator for tabular data (LM-IGTD); and to assess the effectiveness of transfer learning and fine-tuning for improving predictions on tabular data.</div></div><div><h3>Methods:</h3><div>We employed two public tabular datasets with patients diagnosed with cardiovascular diseases (CVDs): Framingham and Steno. First, both datasets were transformed into images using LM-IGTD. Then, Framingham, which contains a larger set of samples than Steno, is used to train CNN-based models. Finally, we performed transfer learning and fine-tuning using the pre-trained CNN on the Steno dataset to predict CVD risk.</div></div><div><h3>Results:</h3><div>The CNN-based model with transfer learning achieved the highest AUCORC in Steno (0.855), outperforming ML models such as decision trees, K-nearest neighbours, least absolute shrinkage and selection operator (LASSO) support vector machine and TabPFN. This approach improved accuracy by 2% over the best-performing traditional model, TabPFN.</div></div><div><h3>Conclusion:</h3><div>To the best of our knowledge, this is the first study that evaluates the effectiveness of applying transfer learning and fine-tuning to tabular data using tabular-to-image approaches. Through the use of CNNs’ predictive capabilities, our work also advances the diagnosis of CVD by providing a framework for early clinical intervention and decision-making support.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104821"},"PeriodicalIF":4.0,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143799192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moxuan Ma , Muyu Wang , Lan Wei , Xiaolu Fei , Hui Chen
{"title":"Multi-modal fusion model for Time-Varying medical Data: Addressing Long-Term dependencies and memory challenges in sequence fusion","authors":"Moxuan Ma , Muyu Wang , Lan Wei , Xiaolu Fei , Hui Chen","doi":"10.1016/j.jbi.2025.104823","DOIUrl":"10.1016/j.jbi.2025.104823","url":null,"abstract":"<div><h3>Background</h3><div>Multi-modal time-varying data continuously generated during a patient’s hospitalization reflects the patient’s disease progression. Certain patient conditions may be associated with long-term states, which is a weakness of current medical multi-modal time-varying data fusion models. Daily ward round notes, as time-series long texts, are often neglected by models.</div></div><div><h3>Objective</h3><div>This study aims to develop an effective medical multi-modal time-varying data fusion model capable of extracting features from long sequences and long texts while capturing long-term dependencies.</div></div><div><h3>Methods</h3><div>We proposed a model called medical multi-modal fusion for long-term dependencies (MMF-LD) that fuses time-varying and time-invariant, tabular, and textual data. A progressive multi-modal fusion (PMF) strategy was introduced to address information loss in multi-modal time series fusion, particularly for long time-varying texts. With the integration of the attention mechanism, the long short-term storage memory (LSTsM) gained enhanced capacity to extract long-term dependencies. In conjunction with the temporal convolutional network (TCN), it extracted long-term features from time-varying sequences without neglecting the local contextual information of the time series. Model performance was evaluated on acute myocardial infarction (AMI) and stroke datasets for in-hospital mortality risk prediction and long length-of-stay prediction. area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and F1 score were used as evaluation metrics for model performance.</div></div><div><h3>Results</h3><div>The MMF-LD model demonstrated superior performance compared to other multi-modal time-varying data fusion models in model comparison experiments (AUROC: 0.947 and 0.918 in the AMI dataset, and 0.965 and 0.868 in the stroke dataset; AUPRC: 0.410 and 0.675, and 0.467 and 0.533; F1 score: 0.658 and 0.513, and 0.684 and 0.401). Ablation experiments confirmed that the proposed PMF strategy, LSTsM, and TCN modules all contributed to performance improvements as intended.</div></div><div><h3>Conclusions</h3><div>The proposed medical multi-modal time-varying data fusion architecture addresses the challenge of forgetting time-varying long textual information in time series fusion. It exhibits stable performance across multiple datasets and tasks. It exhibits strength in capturing long-term dependencies and shows stable performance across multiple datasets and tasks.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104823"},"PeriodicalIF":4.0,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kory Kreimeyer , Jonathan Spiker , Oanh Dang , Suranjan De , Robert Ball , Taxiarchis Botsis
{"title":"Deduplicating the FDA adverse event reporting system with a novel application of network-based grouping","authors":"Kory Kreimeyer , Jonathan Spiker , Oanh Dang , Suranjan De , Robert Ball , Taxiarchis Botsis","doi":"10.1016/j.jbi.2025.104824","DOIUrl":"10.1016/j.jbi.2025.104824","url":null,"abstract":"<div><h3>Objective</h3><div>To improve the reliability of data mining for product safety concerns in the Food and Drug Administration’s (FDA) Adverse Event Reporting System (FAERS) by robustly identifying duplicate reports describing the same patient experience.</div></div><div><h3>Materials and methods</h3><div>A duplicate detection algorithm based on a probabilistic record linkage algorithm, including features extracted from report narratives, and designed to support FAERS case safety review as part of the Information Visualization Platform (InfoViP) has been upgraded into a full deduplication pipeline for the entire FAERS database. The pipeline contains several new and updated components, including a network analysis-based community detection routine for breaking up sparsely connected groups of duplicates constructed from chains of pairwise comparisons. The pipeline was applied to all 29 million FAERS reports to assemble groups of duplicate cases.</div></div><div><h3>Results</h3><div>The pipeline was evaluated on 12 human expert adjudicated data sets with a total of 2300 reports and was found to have better overall performance than the current tool used at the FDA for labeling duplicates on 10 of them, with F1 scores ranging from 0.36 to 0.93, with half above 0.75. Because minimizing false discovery increases human expert review efficiency, the improved deduplication pipeline was applied to all historic and daily incoming FAERS reports at FDA and identified about 5 million reports as duplicates.</div></div><div><h3>Conclusions</h3><div>The InfoViP deduplication pipeline is operating at FDA to identify duplicate case reports in FAERS and provide deduplicated input for improved efficiency and accuracy of safety review operations like adverse event data mining calculations.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104824"},"PeriodicalIF":4.0,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haihua Chen , Ruochi Li , Ana Cleveland , Junhua Ding
{"title":"Enhancing data quality in medical concept normalization through large language models","authors":"Haihua Chen , Ruochi Li , Ana Cleveland , Junhua Ding","doi":"10.1016/j.jbi.2025.104812","DOIUrl":"10.1016/j.jbi.2025.104812","url":null,"abstract":"<div><h3>Objective:</h3><div>Medical concept normalization (MCN) aims to map informal medical terms to formal medical concepts, a critical task in building machine learning systems for medical applications. However, most existing studies on MCN primarily focus on models and algorithms, often overlooking the vital role of data quality. This research evaluates MCN performance across varying data quality scenarios and investigates how to leverage these evaluation results to enhance data quality, ultimately improving MCN performance through the use of large language models (LLMs). The effectiveness of the proposed approach is demonstrated through a case study.</div></div><div><h3>Methods:</h3><div>We begin by conducting a data quality evaluation of a dataset used for MCN. Based on these findings, we employ ChatGPT-based zero-shot prompting for data augmentation. The quality of the generated data is then assessed across the dimensions of correctness and comprehensiveness. A series of experiments is performed to analyze the impact of data quality on MCN model performance. These results guide us in implementing LLM-based few-shot prompting to further enhance data quality and improve model performance.</div></div><div><h3>Results:</h3><div>Duplication of data items within a dataset can lead to inaccurate evaluation results. Data augmentation techniques such as zero-shot and few-shot learning with ChatGPT can introduce duplicated data items, particularly those in the mean region of a dataset’s distribution. As such, data augmentation strategies must be carefully designed, incorporating context information and training data to avoid these issues. Additionally, we found that including augmented data in the testing set is necessary to fairly evaluate the effectiveness of data augmentation strategies.</div></div><div><h3>Conclusion:</h3><div>While LLMs can generate high-quality data for MCN, the success of data augmentation depends heavily on the strategy employed. Our study found that few-shot learning, with prompts that incorporate appropriate context and a small, representative set of original data, is an effective approach. The methods developed in this research, including the data quality evaluation framework, LLM-based data augmentation strategies, and procedures for data quality enhancement, provide valuable insights for data augmentation and evaluation in similar deep learning applications.</div></div><div><h3>Availability:</h3><div><span><span>https://github.com/RichardLRC/mcn-data-quality-llm/tree/main/evaluation</span><svg><path></path></svg></span></div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104812"},"PeriodicalIF":4.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manqi Zhou , Alice S. Tang , Hao Zhang , Zhenxing Xu , Alison M.C. Ke , Chang Su , Yu Huang , William G. Mantyh , Michael S. Jaffee , Katherine P. Rankin , Steven T. DeKosky , Jiayu Zhou , Yi Guo , Jiang Bian , Marina Sirota , Fei Wang
{"title":"Identifying progression subphenotypes of Alzheimer’s disease from large-scale electronic health records with machine learning","authors":"Manqi Zhou , Alice S. Tang , Hao Zhang , Zhenxing Xu , Alison M.C. Ke , Chang Su , Yu Huang , William G. Mantyh , Michael S. Jaffee , Katherine P. Rankin , Steven T. DeKosky , Jiayu Zhou , Yi Guo , Jiang Bian , Marina Sirota , Fei Wang","doi":"10.1016/j.jbi.2025.104820","DOIUrl":"10.1016/j.jbi.2025.104820","url":null,"abstract":"<div><h3>Objective</h3><div>Identification of clinically meaningful subphenotypes of disease progression can enhance the understanding of disease heterogeneity and underlying pathophysiology. In this study, we propose a machine learning framework to identify subphenotypes of Alzheimer’s disease progression based on longitudinal real-world patient records.</div></div><div><h3>Methods</h3><div>The framework, dynaPhenoM, extracts coherent clinical topics across patient visits and employs a time-aware latent class analysis to characterize subphenotypes. We validated dynaPhenoM using three patient databases with a total of 3952 AD patients across the United States, demonstrating its effectiveness in revealing mild cognitive impairment (MCI) progression to AD.</div></div><div><h3>Results</h3><div>Our study identified five subphenotypes associated with distinct organ systems for disease progression from MCI to AD, including common subtypes across cohorts—respiratory, musculoskeletal, cardiovascular, and endocrine/metabolic—as well as a cohort-specific digestive subtype.</div></div><div><h3>Conclusion</h3><div>Our study unravels the complexity and heterogeneity of the progression from MCI to AD. These findings highlight disease progression heterogeneity and can inform both diagnostic and therapeutic strategies, thereby advancing precision medicine for Alzheimer’s disease.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104820"},"PeriodicalIF":4.0,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alissa L. Russ-Jara , Jason J. Saleem , Jennifer Herout
{"title":"A practical guide to usability questionnaires that evaluate clinicians’ perceptions of health information technology","authors":"Alissa L. Russ-Jara , Jason J. Saleem , Jennifer Herout","doi":"10.1016/j.jbi.2025.104822","DOIUrl":"10.1016/j.jbi.2025.104822","url":null,"abstract":"<div><h3>Objective</h3><div>Numerous usability questionnaires are available to evaluate the usability of health information technology (IT). It can be difficult for practitioners to determine which questionnaire most closely aligns with their health IT evaluation goals. Our objective was to develop a practical guide to enable practitioners to select an appropriate usability questionnaire for their health IT evaluation.</div></div><div><h3>Methods</h3><div>Questionnaires were identified from the literature and input from usability experts. Inclusion criteria included: 1) post-test or post-task usability questionnaire; 2) demonstrated validity, with good internal reliability (Cronbach α ≥ 0.70); 3) freely available for use; 4) applicable across a wide range of health IT products; and 5) demonstrated use with health IT in peer-reviewed literature, even if not originally designed for healthcare.</div></div><div><h3>Results</h3><div>Criteria were met by seven usability questionnaires. Results include a synopsis of each usability questionnaire along with a matrix to visually compare methodological characteristics across questionnaires. Additionally, results include an analysis of distinguishing methodological strengths and limitations that set each usability questionnaire apart. For each questionnaire, we also outline considerations for use when evaluating health IT.</div></div><div><h3>Conclusion</h3><div>This novel, practical guide provides an important methodological analysis of currently available usability questionnaires for health IT evaluation. This article can help practitioners make a more efficient, but also well-informed, choice when selecting a usability questionnaire for health IT evaluation. This practical, methodological guide applies to a wide range of health IT products, including electronic health records (EHRs).</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104822"},"PeriodicalIF":4.0,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143772547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and efficient on-chain data management in blockchain for large biomedical data","authors":"Eric Ni , Elizabeth Knight , Mark Gerstein","doi":"10.1016/j.jbi.2025.104818","DOIUrl":"10.1016/j.jbi.2025.104818","url":null,"abstract":"<div><div>Blockchain technology is gaining traction in the biomedical sector due to its ability to improve trust and reduce the risk of fraud and errors in health data management. However, the large volume of biomedical datasets has slowed its adoption due to poor scalability. This challenge is especially relevant for applications that rely on blockchain’s strong immutability by storing data directly on-chain. In this work, we demonstrate the potential of blockchain to create a secure and trustless environment for managing large on-chain records. Specifically, we detail an efficient, index-based approach for storing data on the Ethereum blockchain. We show that insertion and retrieval speeds remain nearly constant relative to database size, scaling linearly with the amount of data processed. Additionally, we achieve substantial efficiency gains through low-level assembly optimizations on the Ethereum Virtual Machine, highlighting the limitations of the Solidity compiler. Finally, we illustrate this approach through a practical case study, by designing and implementing a smart contract for storing and querying training certificates on the Ethereum blockchain. Our solution achieves 2x faster data insertion, 500x faster retrieval, 60% lower gas costs, and 50% lower storage usage compared to baseline methods. It won first place for track 1 of the 2022 iDASH secure genome analysis competition. We also demonstrate that this solution readily adapts to other data types, enabling efficient on-chain storage and retrieval of text, RNA-seq, or biomedical image data.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104818"},"PeriodicalIF":4.0,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143753041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giovanni Maria De Filippis, Domenico Amalfitano, Cristiano Russo, Cristian Tommasino, Antonio Maria Rinaldi
{"title":"A systematic mapping study of semantic technologies in multi-omics data integration","authors":"Giovanni Maria De Filippis, Domenico Amalfitano, Cristiano Russo, Cristian Tommasino, Antonio Maria Rinaldi","doi":"10.1016/j.jbi.2025.104809","DOIUrl":"10.1016/j.jbi.2025.104809","url":null,"abstract":"<div><h3>Objective:</h3><div>The integration of multi-omics data is essential for understanding complex biological systems, providing insights beyond single-omics approaches. However, challenges related to data heterogeneity, standardization, and computational scalability persist. This study explores the interdisciplinary application of semantic technologies to enhance data integration, standardization, and analysis in multi-omics research.</div></div><div><h3>Methods:</h3><div>We performed a systematic mapping study assessing literature from 2014 to 2024, focusing on the utilization of ontologies, knowledge graphs, and graph-based methods for multi-omics integration.</div></div><div><h3>Results:</h3><div>Our findings indicate a growing number of publications in this field, predominantly appearing in high-impact journals. The deployment of semantic technologies has notably improved data visualization, querying, and management, thus enhancing gene and pathway discovery, and providing deeper disease insights and more accurate predictive modeling.</div></div><div><h3>Conclusion:</h3><div>The study underscores the significance of semantic technologies in overcoming multi-omics integration challenges. Future research should focus on integrating diverse data types, developing advanced computational tools, and incorporating AI and machine learning to foster personalized medicine applications.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104809"},"PeriodicalIF":4.0,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143738327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming-Hui Shi , Shao-Wu Zhang , Qing-Qing Zhang , Yong Han , Shanwen Zhang
{"title":"PLAGCA: Predicting protein–ligand binding affinity with the graph cross-attention mechanism","authors":"Ming-Hui Shi , Shao-Wu Zhang , Qing-Qing Zhang , Yong Han , Shanwen Zhang","doi":"10.1016/j.jbi.2025.104816","DOIUrl":"10.1016/j.jbi.2025.104816","url":null,"abstract":"<div><div>Accurate prediction of protein–ligand binding affinity plays a crucial role in drug discovery. However, determining the binding affinity of protein–ligands through biological experimental approaches is both time-consuming and expensive. Although some computational methods have been developed to predict protein–ligands binding affinity, most existing methods extract the global features of proteins and ligands through separate encoders, without considering to extract the local pocket interaction features of protein–ligand complexes, resulting in the limited prediction accuracy. In this work, we proposed a novel Protein–Ligand binding Affinity prediction method (named PLAGCA) by introducing Graph Cross-Attention mechanism to learn the local three-dimensional (3D) features of protein–ligand pockets, and integrating the global sequence/string features and local graph interaction features of protein–ligand complexes. PLAGCA uses sequence encoding and self-attention to extract the protein/ligand global features from protein FASTA sequences/ligand SMILES strings, adopts graph neural network and cross-attention to extract the protein–ligand local interaction features from the molecular structures of protein binding pockets and ligands. All these features are concatenated and input into a multi-layer perceptron (MLP) for predicting the protein–ligand binding affinity. The experimental results show that our PLAGCA outperforms other state-of-the-art computational methods, and it can effectively predict protein–ligand binding affinity with superior generalization capability. PLAGCA can capture the critical functional residues that are important contribution to the protein–ligand binding.</div></div>","PeriodicalId":15263,"journal":{"name":"Journal of Biomedical Informatics","volume":"165 ","pages":"Article 104816"},"PeriodicalIF":4.0,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}