使用SEQENS识别相关特征以改进预测AML治疗结果的监督机器学习模型。

IF 3.3 3区医学 Q2 MEDICAL INFORMATICS

BMC Medical Informatics and Decision Making Pub Date : 2025-05-01 DOI:10.1186/s12911-025-03001-y

Pedro Pons-Suñer, François Signol, Noemi Alvarez, Claudia Sargas, Sara Dorado, Jose Vicente Gil Ortí, Juan A Delgado Sanchis, Marta Llop, Laura Arnal, Rafael Llobet, Juan-Carlos Perez-Cortes, Rosa Ayala, Eva Barragán

{"title":"使用SEQENS识别相关特征以改进预测AML治疗结果的监督机器学习模型。","authors":"Pedro Pons-Suñer, François Signol, Noemi Alvarez, Claudia Sargas, Sara Dorado, Jose Vicente Gil Ortí, Juan A Delgado Sanchis, Marta Llop, Laura Arnal, Rafael Llobet, Juan-Carlos Perez-Cortes, Rosa Ayala, Eva Barragán","doi":"10.1186/s12911-025-03001-y","DOIUrl":null,"url":null,"abstract":"Background and objective: This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications.Methods: A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance.Results: SEQENS identified different relevant features for each prediction horizon, with Age, TP53, - 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year).Conclusions: The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"25 1","pages":"179"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12044950/pdf/","citationCount":"0","resultStr":"{\"title\":\"Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome.\",\"authors\":\"Pedro Pons-Suñer, François Signol, Noemi Alvarez, Claudia Sargas, Sara Dorado, Jose Vicente Gil Ortí, Juan A Delgado Sanchis, Marta Llop, Laura Arnal, Rafael Llobet, Juan-Carlos Perez-Cortes, Rosa Ayala, Eva Barragán\",\"doi\":\"10.1186/s12911-025-03001-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background and objective: This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications.Methods: A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance.Results: SEQENS identified different relevant features for each prediction horizon, with Age, TP53, - 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year).Conclusions: The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.\",\"PeriodicalId\":9340,\"journal\":{\"name\":\"BMC Medical Informatics and Decision Making\",\"volume\":\"25 1\",\"pages\":\"179\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12044950/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Medical Informatics and Decision Making\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12911-025-03001-y\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-025-03001-y","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

摘要

背景和目的：本研究有两个主要目的。首先，评估了一种基于序列的特征选择方法，序列是一种识别相关变量的算法。其次，验证机器学习模型，利用诊断时可用的数据预测急性髓性白血病（AML）患者的并发症风险。在三个时间点进行预测：诊断后90天、6个月和1年。这些目标代表了开发一种工具的基本步骤，该工具可以帮助临床医生做出治疗决策，并提供与AML并发症相关的危险因素的见解。方法：结合西班牙马德里10月12日医院（Hospital 12 de Octubre）和西班牙瓦伦西亚Investigación La Fe卫生研究所（Instituto de Sanitaria La Fe）的数据，建立568例患者的数据集，包括人口统计学、临床、遗传学（VAF）和细胞遗传学信息。基于增强版的SEQENS对每个时间点进行特征选择，然后比较四种分类器（XGBoost、多层感知器、逻辑回归和决策树），以评估特征选择对模型性能的影响。结果：SEQENS在每个预测范围内识别出不同的相关特征，年龄、TP53、- 7/7Q和EZH2在所有时间点上都一致相关。使用5倍交叉验证对模型进行评估，XGBoost在90天、6个月和1年的预测中分别达到了0.81、0.84和0.82的最高平均ROC-AUC分数。通常，应用基于序列序列的特征选择后，性能保持稳定或有所提高。对54例患者的外部测试集进行评估，ROC-AUC评分为0.72（90天），0.75（6个月）和0.68（1年）。结论：这些模型达到了性能水平，表明它们可以在诊断后的不同时间作为治疗决策支持工具。选择的变量与欧洲白血病网（ELN） 2022风险分类保持一致，基于seqens的特征选择在保持预测准确性的同时有效地减少了特征集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Identification of relevant features using SEQENS to improve supervised machine learning models predicting AML treatment outcome.

Background and objective: This study has two main objectives. First, to evaluate a feature selection methodology based on SEQENS, an algorithm for identifying relevant variables. Second, to validate machine learning models that predict the risk of complications in patients with acute myeloid leukemia (AML) using data available at diagnosis. Predictions are made at three time points: 90 days, six months, and one year post-diagnosis. These objectives represent fundamental steps toward the development of a tool to assist clinicians in therapeutic decision-making and provide insights into the risk factors associated with AML complications.

Methods: A dataset of 568 patients, including demographic, clinical, genetic (VAF), and cytogenetic information, was created by combining data from Hospital 12 de Octubre (Madrid, Spain) and Instituto de Investigación Sanitaria La Fe (Valencia, Spain). Feature selection based on an enhanced version of SEQENS was conducted for each time point, followed by the comparison of four classifiers (XGBoost, Multi-Layer Perceptron, Logistic Regression and Decision Tree) to assess the impact of feature selection on model performance.

Results: SEQENS identified different relevant features for each prediction horizon, with Age, TP53, - 7/7Q, and EZH2 consistently relevant across all time points. The models were evaluated using 5-fold cross-validation, XGBoost achieve the highest average ROC-AUC scores of 0.81, 0.84, and 0.82 for 90-day, 6-month, and 1-year predictions, respectively. Generally, performance remained stable or improved after applying SEQENS-based feature selection. Evaluation on an external test set of 54 patients yielded ROC-AUC scores of 0.72 (90-day), 0.75 (6-month), and 0.68 (1-year).

Conclusions: The models achieved performance levels that suggest they could serve as therapeutic decision support tools at different times after diagnosis. The selected variables align with the European LeukemiaNet (ELN) 2022 risk classification, and the SEQENS-based feature selection effectively reduced the feature set while maintaining prediction accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Medical Informatics and Decision Making 医学-医学：信息

CiteScore

7.20

自引率

5.70%

发文量

297

审稿时长

1 months

期刊介绍： BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.