Vladimir Cardenas, Yalin Li, Samika Shrestha, Hong Xue
{"title":"Prediction of Breast Cancer Remission.","authors":"Vladimir Cardenas, Yalin Li, Samika Shrestha, Hong Xue","doi":"10.1097/QMH.0000000000000513","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and objectives: </strong>This study aims to use electronic health records (EHR) and social determinants of health (SDOH) data to predict breast cancer remission. The emphasis is placed on utilizing easily accessible information to improve predictive models, facilitate the early detection of high-risk patients, and facilitate targeted interventions and personalized care strategies.</p><p><strong>Methods: </strong>This study identifies individuals who are unlikely to respond to standard treatment of breast cancer. The study identified 1621 patients with breast cancer by selecting patients who received tamoxifen in the All of Us Research Database. The dependent variable, remission, was defined using tamoxifen exposure as a proxy. Data preprocessing involved creating dummy variables for diseases, demographic, and socioeconomic factors and handling missing values to maintain data integrity. For the feature selection phase, we utilized the strong rule for feature elimination and then logistic least absolute shrinkage and selection operator regression with 5-fold cross-validation to reduce the number of predictors by retaining only those with coefficients with an absolute value greater than 0.01. We then trained machine learning models using logistic regression, random forest, naïve Bayes, and extreme gradient boost using area under the receiver operating curve (AUROC) metric to score model performance. This created race-neutral model performance. Finally, we analyzed model performance for race and ethnicity test populations including Non-Hispanic White, Non-Hispanic Black, Hispanic, and Other Race or Ethnicity. These generated race-specific model performance.</p><p><strong>Results: </strong>The model achieved an AUROC range between 0.68 and 0.75, with logistic regression and random forest trained on data without interaction terms demonstrating the best performance. Feature selection identified significant factors such as melanocytic nevus and bone disorders, highlighting the importance of these factors in predictive accuracy. Race-specific model performance was lower than race-neutral model performance for Non-Hispanic Blacks, and Other Race and Ethnicity Groups.</p><p><strong>Conclusions: </strong>In conclusion, our research demonstrates the feasibility of predicting breast cancer non-remission using EHR and SDOH data, achieving acceptable performance without complex predictors. Addressing the data quality limitations and refining remission indicators can further improve the models' utility for early treatment decisions, fostering improved patient outcomes and support throughout the cancer journey.</p>","PeriodicalId":20986,"journal":{"name":"Quality Management in Health Care","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quality Management in Health Care","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/QMH.0000000000000513","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Background and objectives: This study aims to use electronic health records (EHR) and social determinants of health (SDOH) data to predict breast cancer remission. The emphasis is placed on utilizing easily accessible information to improve predictive models, facilitate the early detection of high-risk patients, and facilitate targeted interventions and personalized care strategies.
Methods: This study identifies individuals who are unlikely to respond to standard treatment of breast cancer. The study identified 1621 patients with breast cancer by selecting patients who received tamoxifen in the All of Us Research Database. The dependent variable, remission, was defined using tamoxifen exposure as a proxy. Data preprocessing involved creating dummy variables for diseases, demographic, and socioeconomic factors and handling missing values to maintain data integrity. For the feature selection phase, we utilized the strong rule for feature elimination and then logistic least absolute shrinkage and selection operator regression with 5-fold cross-validation to reduce the number of predictors by retaining only those with coefficients with an absolute value greater than 0.01. We then trained machine learning models using logistic regression, random forest, naïve Bayes, and extreme gradient boost using area under the receiver operating curve (AUROC) metric to score model performance. This created race-neutral model performance. Finally, we analyzed model performance for race and ethnicity test populations including Non-Hispanic White, Non-Hispanic Black, Hispanic, and Other Race or Ethnicity. These generated race-specific model performance.
Results: The model achieved an AUROC range between 0.68 and 0.75, with logistic regression and random forest trained on data without interaction terms demonstrating the best performance. Feature selection identified significant factors such as melanocytic nevus and bone disorders, highlighting the importance of these factors in predictive accuracy. Race-specific model performance was lower than race-neutral model performance for Non-Hispanic Blacks, and Other Race and Ethnicity Groups.
Conclusions: In conclusion, our research demonstrates the feasibility of predicting breast cancer non-remission using EHR and SDOH data, achieving acceptable performance without complex predictors. Addressing the data quality limitations and refining remission indicators can further improve the models' utility for early treatment decisions, fostering improved patient outcomes and support throughout the cancer journey.
期刊介绍:
Quality Management in Health Care (QMHC) is a peer-reviewed journal that provides a forum for our readers to explore the theoretical, technical, and strategic elements of health care quality management. The journal''s primary focus is on organizational structure and processes as these affect the quality of care and patient outcomes. In particular, it:
-Builds knowledge about the application of statistical tools, control charts, benchmarking, and other devices used in the ongoing monitoring and evaluation of care and of patient outcomes;
-Encourages research in and evaluation of the results of various organizational strategies designed to bring about quantifiable improvements in patient outcomes;
-Fosters the application of quality management science to patient care processes and clinical decision-making;
-Fosters cooperation and communication among health care providers, payers and regulators in their efforts to improve the quality of patient outcomes;
-Explores links among the various clinical, technical, administrative, and managerial disciplines involved in patient care, as well as the role and responsibilities of organizational governance in ongoing quality management.