Vladimir Cardenas, Yalin Li, Samika Shrestha, Hong Xue
{"title":"Prediction of Breast Cancer Remission.","authors":"Vladimir Cardenas, Yalin Li, Samika Shrestha, Hong Xue","doi":"10.1097/QMH.0000000000000513","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and objectives: </strong>This study aims to use electronic health records (EHR) and social determinants of health (SDOH) data to predict breast cancer remission. The emphasis is placed on utilizing easily accessible information to improve predictive models, facilitate the early detection of high-risk patients, and facilitate targeted interventions and personalized care strategies.</p><p><strong>Methods: </strong>This study identifies individuals who are unlikely to respond to standard treatment of breast cancer. The study identified 1621 patients with breast cancer by selecting patients who received tamoxifen in the All of Us Research Database. The dependent variable, remission, was defined using tamoxifen exposure as a proxy. Data preprocessing involved creating dummy variables for diseases, demographic, and socioeconomic factors and handling missing values to maintain data integrity. For the feature selection phase, we utilized the strong rule for feature elimination and then logistic least absolute shrinkage and selection operator regression with 5-fold cross-validation to reduce the number of predictors by retaining only those with coefficients with an absolute value greater than 0.01. We then trained machine learning models using logistic regression, random forest, naïve Bayes, and extreme gradient boost using area under the receiver operating curve (AUROC) metric to score model performance. This created race-neutral model performance. Finally, we analyzed model performance for race and ethnicity test populations including Non-Hispanic White, Non-Hispanic Black, Hispanic, and Other Race or Ethnicity. These generated race-specific model performance.</p><p><strong>Results: </strong>The model achieved an AUROC range between 0.68 and 0.75, with logistic regression and random forest trained on data without interaction terms demonstrating the best performance. Feature selection identified significant factors such as melanocytic nevus and bone disorders, highlighting the importance of these factors in predictive accuracy. Race-specific model performance was lower than race-neutral model performance for Non-Hispanic Blacks, and Other Race and Ethnicity Groups.</p><p><strong>Conclusions: </strong>In conclusion, our research demonstrates the feasibility of predicting breast cancer non-remission using EHR and SDOH data, achieving acceptable performance without complex predictors. Addressing the data quality limitations and refining remission indicators can further improve the models' utility for early treatment decisions, fostering improved patient outcomes and support throughout the cancer journey.</p>","PeriodicalId":20986,"journal":{"name":"Quality Management in Health Care","volume":" ","pages":"173-180"},"PeriodicalIF":1.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quality Management in Health Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/QMH.0000000000000513","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background and objectives: This study aims to use electronic health records (EHR) and social determinants of health (SDOH) data to predict breast cancer remission. The emphasis is placed on utilizing easily accessible information to improve predictive models, facilitate the early detection of high-risk patients, and facilitate targeted interventions and personalized care strategies.
Methods: This study identifies individuals who are unlikely to respond to standard treatment of breast cancer. The study identified 1621 patients with breast cancer by selecting patients who received tamoxifen in the All of Us Research Database. The dependent variable, remission, was defined using tamoxifen exposure as a proxy. Data preprocessing involved creating dummy variables for diseases, demographic, and socioeconomic factors and handling missing values to maintain data integrity. For the feature selection phase, we utilized the strong rule for feature elimination and then logistic least absolute shrinkage and selection operator regression with 5-fold cross-validation to reduce the number of predictors by retaining only those with coefficients with an absolute value greater than 0.01. We then trained machine learning models using logistic regression, random forest, naïve Bayes, and extreme gradient boost using area under the receiver operating curve (AUROC) metric to score model performance. This created race-neutral model performance. Finally, we analyzed model performance for race and ethnicity test populations including Non-Hispanic White, Non-Hispanic Black, Hispanic, and Other Race or Ethnicity. These generated race-specific model performance.
Results: The model achieved an AUROC range between 0.68 and 0.75, with logistic regression and random forest trained on data without interaction terms demonstrating the best performance. Feature selection identified significant factors such as melanocytic nevus and bone disorders, highlighting the importance of these factors in predictive accuracy. Race-specific model performance was lower than race-neutral model performance for Non-Hispanic Blacks, and Other Race and Ethnicity Groups.
Conclusions: In conclusion, our research demonstrates the feasibility of predicting breast cancer non-remission using EHR and SDOH data, achieving acceptable performance without complex predictors. Addressing the data quality limitations and refining remission indicators can further improve the models' utility for early treatment decisions, fostering improved patient outcomes and support throughout the cancer journey.
背景和目的:本研究旨在利用电子健康记录(EHR)和健康社会决定因素(SDOH)数据预测乳腺癌缓解。重点是利用易于获取的信息来改进预测模型,促进高风险患者的早期发现,并促进有针对性的干预和个性化的护理策略。方法:本研究确定了不太可能对乳腺癌标准治疗有反应的个体。该研究通过在All of Us研究数据库中选择接受他莫昔芬治疗的患者,确定了1621名乳腺癌患者。因变量,缓解,被定义为使用他莫昔芬暴露作为代理。数据预处理包括为疾病、人口统计和社会经济因素创建虚拟变量,并处理缺失值以保持数据完整性。在特征选择阶段,我们使用强规则进行特征消除,然后使用逻辑最小绝对收缩和选择算子回归进行5倍交叉验证,通过仅保留绝对值大于0.01的系数来减少预测因子的数量。然后,我们使用逻辑回归、随机森林、naïve贝叶斯和极端梯度提升来训练机器学习模型,并使用接收者工作曲线下的面积(AUROC)度量来对模型性能进行评分。这创造了种族中立的模型性能。最后,我们分析了非西班牙裔白人、非西班牙裔黑人、西班牙裔和其他种族或民族测试人群的模型性能。这些生成的特定于种族的模型性能。结果:该模型的AUROC范围在0.68 ~ 0.75之间,其中逻辑回归和随机森林在没有交互项的数据上训练的效果最好。特征选择确定了诸如黑素细胞痣和骨骼疾病等重要因素,突出了这些因素在预测准确性方面的重要性。在非西班牙裔黑人和其他种族和族裔群体中,种族特异性模型的表现低于种族中性模型的表现。结论:总之,我们的研究证明了使用EHR和SDOH数据预测乳腺癌非缓解的可行性,在没有复杂预测因素的情况下取得了可接受的效果。解决数据质量限制和改善缓解指标可以进一步提高模型在早期治疗决策中的效用,促进改善患者的结果和在整个癌症过程中的支持。
期刊介绍:
Quality Management in Health Care (QMHC) is a peer-reviewed journal that provides a forum for our readers to explore the theoretical, technical, and strategic elements of health care quality management. The journal''s primary focus is on organizational structure and processes as these affect the quality of care and patient outcomes. In particular, it:
-Builds knowledge about the application of statistical tools, control charts, benchmarking, and other devices used in the ongoing monitoring and evaluation of care and of patient outcomes;
-Encourages research in and evaluation of the results of various organizational strategies designed to bring about quantifiable improvements in patient outcomes;
-Fosters the application of quality management science to patient care processes and clinical decision-making;
-Fosters cooperation and communication among health care providers, payers and regulators in their efforts to improve the quality of patient outcomes;
-Explores links among the various clinical, technical, administrative, and managerial disciplines involved in patient care, as well as the role and responsibilities of organizational governance in ongoing quality management.