{"title":"Development of an ensemble prediction model for acute graft-versus-host disease in allogeneic transplantation based on machine learning.","authors":"Lin Song, Xingwei Wu, Mengjia Xu, Ling Xue, Xun Yu, Zongqi Cheng, Chenrong Huang, Liyan Miao","doi":"10.1186/s12911-025-03059-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Acute graft-versus-host disease (aGVHD) is a major post-transplantation complication and one of the most significant causes of non-relapse-related death. However, the massive and complex clinical data make aGVHD difficult to predict. Machine learning (ML), a branch of artificial intelligence, has since been introduced in medicine due to its ability to process complex, high-dimensional variables quickly and capture nonlinear relationships. However, the effects of immunosuppressants exposure was not considered in previous ML models. Thus, the purpose of this study was to develop and optimize models by Cox regression and machine learning algorithms to predict the risk of aGVHD in which cyclosporin A exposure and common clinical factors were included as variables.</p><p><strong>Methods: </strong>The data was preprocessed in the first step, and was randomly allocated at an 8:2 ratio. Cox regression model was constructed on the training set. Meanwhile, correlation analysis and recursive feature elimination were used for feature screening before machine learning model development. Then fifteen algorithms were used to establish models, and an ensemble model was established through soft voting based on the top five performance algorithms. Area under curve (AUC) was the main metric used to evaluate the model performance in the validation set, while nomogram and SHAP were applied to interpret the variables.</p><p><strong>Result: </strong>A total of 479 patients and 47 variables were included in the study. The incidence of grade II-IV aGVHD was 33.61%. The AUC of Cox regression model in the validation set was 0.625. In contrast, the new ensemble model has a better prediction ability (AUC = 0.776, Accuracy = 0.729, Precision = 0.667, Recall = 0.375, F1-score = 0.480). Except for the variables which were identified by previous studies, some rarely reported risk factors were found, such as quinolone, blood urea nitrogen and alkaline phosphatase.</p><p><strong>Conclusions: </strong>In summary, a new ensemble model with promising accuracy was established to predict grade II-IV classic aGVHD in allo-HSCT patients. It will help identify high-risk patients at an early stage and thus reduce the incidence of aGVHD.</p><p><strong>Clinical trial number: </strong>Not applicable.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"234"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12219984/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03059-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Acute graft-versus-host disease (aGVHD) is a major post-transplantation complication and one of the most significant causes of non-relapse-related death. However, the massive and complex clinical data make aGVHD difficult to predict. Machine learning (ML), a branch of artificial intelligence, has since been introduced in medicine due to its ability to process complex, high-dimensional variables quickly and capture nonlinear relationships. However, the effects of immunosuppressants exposure was not considered in previous ML models. Thus, the purpose of this study was to develop and optimize models by Cox regression and machine learning algorithms to predict the risk of aGVHD in which cyclosporin A exposure and common clinical factors were included as variables.
Methods: The data was preprocessed in the first step, and was randomly allocated at an 8:2 ratio. Cox regression model was constructed on the training set. Meanwhile, correlation analysis and recursive feature elimination were used for feature screening before machine learning model development. Then fifteen algorithms were used to establish models, and an ensemble model was established through soft voting based on the top five performance algorithms. Area under curve (AUC) was the main metric used to evaluate the model performance in the validation set, while nomogram and SHAP were applied to interpret the variables.
Result: A total of 479 patients and 47 variables were included in the study. The incidence of grade II-IV aGVHD was 33.61%. The AUC of Cox regression model in the validation set was 0.625. In contrast, the new ensemble model has a better prediction ability (AUC = 0.776, Accuracy = 0.729, Precision = 0.667, Recall = 0.375, F1-score = 0.480). Except for the variables which were identified by previous studies, some rarely reported risk factors were found, such as quinolone, blood urea nitrogen and alkaline phosphatase.
Conclusions: In summary, a new ensemble model with promising accuracy was established to predict grade II-IV classic aGVHD in allo-HSCT patients. It will help identify high-risk patients at an early stage and thus reduce the incidence of aGVHD.
期刊介绍:
BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.