Yirong Wu, E. Burnside, Jennifer Cox, Jun Fan, Ming Yuan, Jie Yin, P. Peissig, Alexander G. Cobian, D. Page, M. Craven
{"title":"Breast Cancer Risk Prediction Using Electronic Health Records","authors":"Yirong Wu, E. Burnside, Jennifer Cox, Jun Fan, Ming Yuan, Jie Yin, P. Peissig, Alexander G. Cobian, D. Page, M. Craven","doi":"10.1109/ICHI.2017.62","DOIUrl":null,"url":null,"abstract":"Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.","PeriodicalId":263611,"journal":{"name":"2017 IEEE International Conference on Healthcare Informatics (ICHI)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Healthcare Informatics (ICHI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICHI.2017.62","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.
电子健康记录(EHRs)是一种未充分利用的数据源,具有巨大的研究和临床潜力。我们的目标是量化电子病历在乳腺癌风险预测中的价值。我们进行了一项回顾性病例对照研究,从现有的EHR数据库中收集患者的ICD-9诊断代码。基于ICD-9编码由3-5位数字组成的分层结构,研究了3个层次的数据表示:0级,仅使用前3位数字;级别1,最多使用前4位数字;二级,使用每个代码的5个数字。我们创建了两个模型来预测乳腺癌提前一年的诊断代码在三个层次的数据表示:逻辑回归(LR)和LASSO逻辑回归(LR+ LASSO)。ROC曲线下面积(AUC)用于评估模型的性能。当使用2级特征表示时,LR+Lasso模型的预测性能明显高于LR模型(0.648 vs 0.603, p=0.013)。对于1级表示和0级表示,LR+Lasso和LR模型的预测差异均不显著,分别为(0.634 vs 0.604, p=0.081)和(0.612 vs 0.603, p=0.523)。对于LR模型,预测性能在三个水平上变化不大。对于LR+Lasso模型,从水平0到水平1表示(p=0.168)和从水平1到水平2表示(p=0.374)的预测性能也有适度变化。然而,水平2表示提供了显著高于水平0表示的预测性能(p=0.034)。诊断代码的未删节的第2级表示包含可能有助于乳腺癌风险预测的最有价值的信息。这些模型的表现表明,电子病历数据可以用于预测乳腺癌的风险,这为临床实践中的个性化护理提供了可能。在未来,我们将把编码的电子病历数据与人口危险因素、遗传变异和影像学特征结合起来,以改善乳腺癌的风险预测。