Junheng Chen, Fangfang Han, Mingxiu He, Yiyang Shi, Yongming Cai
{"title":"A novel weighted pseudo-labeling framework based on matrix factorization for adverse drug reaction prediction.","authors":"Junheng Chen, Fangfang Han, Mingxiu He, Yiyang Shi, Yongming Cai","doi":"10.1186/s12859-025-06053-z","DOIUrl":null,"url":null,"abstract":"<p><p>Adverse drug reactions (ADRs) are among the global public health events that seriously endanger human life and cause high economic burdens. Therefore, predicting the possibility of their occurrence and taking early and effective response measures is of great significance. Constructing a correlation matrix between drugs and their adverse reactions, followed by effective correlation data mining, is one of the current strategies to predict ADRs using accessible public data. Since the number of known ADRs in real-world data is far less than the number of their unknown counterparts, the drug-ADR association matrix is very sparse, which greatly affects the classification performance of machine learning methods. To effectively address the problem of sparsity, we proposed a novel weighted pseudo-labeling framework that mines potential unknown drug-ADR pairs by integrating multiple weighted matrix factorization (MF) models and treating them as pseudo-labeled drug-ADR pairs. Pseudo-labeled data is added to the training set, and the MF model is fine-tuned to improve the classification performance. To prevent overfitting to easily found pseudo-labels and improve the quality of pseudo-labels, a novel weighting approach for pseudo-labels was adopted. This paper reproduces the baselines under the same experimental conditions to evaluate the performance of the proposed method on sparse data from the Side Effect Resource (SIDER) database. Experimental results showed that our method outperformed other baselines in the Area Under Precision-Recall and F1-scores and still maintained the best performance in sparser scenarios. Furthermore, we conducted a case study, and the results showed that our proposed framework efficiently predicted ADRs in the real world.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"26 1","pages":"54"},"PeriodicalIF":2.9000,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11831795/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-025-06053-z","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Adverse drug reactions (ADRs) are among the global public health events that seriously endanger human life and cause high economic burdens. Therefore, predicting the possibility of their occurrence and taking early and effective response measures is of great significance. Constructing a correlation matrix between drugs and their adverse reactions, followed by effective correlation data mining, is one of the current strategies to predict ADRs using accessible public data. Since the number of known ADRs in real-world data is far less than the number of their unknown counterparts, the drug-ADR association matrix is very sparse, which greatly affects the classification performance of machine learning methods. To effectively address the problem of sparsity, we proposed a novel weighted pseudo-labeling framework that mines potential unknown drug-ADR pairs by integrating multiple weighted matrix factorization (MF) models and treating them as pseudo-labeled drug-ADR pairs. Pseudo-labeled data is added to the training set, and the MF model is fine-tuned to improve the classification performance. To prevent overfitting to easily found pseudo-labels and improve the quality of pseudo-labels, a novel weighting approach for pseudo-labels was adopted. This paper reproduces the baselines under the same experimental conditions to evaluate the performance of the proposed method on sparse data from the Side Effect Resource (SIDER) database. Experimental results showed that our method outperformed other baselines in the Area Under Precision-Recall and F1-scores and still maintained the best performance in sparser scenarios. Furthermore, we conducted a case study, and the results showed that our proposed framework efficiently predicted ADRs in the real world.
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.