Hongyi Yang, Rich Gonzalez, Brahmajee K Nallamothu, Keith D Aaronson, Kevin R Ward, Alfred O Hero, Sardar Ansari
{"title":"A Practical Approach to Disease Risk Prediction: Focus on High-Risk Patients via Highest-<i>k</i> Loss.","authors":"Hongyi Yang, Rich Gonzalez, Brahmajee K Nallamothu, Keith D Aaronson, Kevin R Ward, Alfred O Hero, Sardar Ansari","doi":"10.1109/bibm58861.2023.10385816","DOIUrl":null,"url":null,"abstract":"<p><p>Disease risk prediction models play an important role in preventing disease developments in modern healthcare. However, the lack of focus on high-risk patients has hindered the large-scale practical application of these models, especially considering the limitation of medical resources available for following up on patients who are deemed high-risk. In this study, we propose a novel and practical approach that focuses on minimizing the number of false positive observations among high-risk patients by introducing the <i>Highest</i>-<i>k Loss</i>. The solution is to estimate the weights of the highest <math><mi>k</mi></math> scores with a differentiable estimation of the sorting operation and apply the weights to the loss function. We extracted 253,680 survey responses from a public dataset of the U.S. health survey system to define a diabetes prediction task. This study employs nested cross-validation as well as an aggregated model applied to an independent test set to systematically evaluate the proposed method. Compared with traditional binary cross entropy loss and Focal loss, the Highest- <math><mi>k</mi></math> loss improved the precision (positive predictive value) for the highest 1% scores by 0.05 (95% CI: 0.041-0.055), the highest 5% scores by 0.03 (95% CI: 0.024-0.032), and the highest 10% scores by 0.02 (95% CI: 0.016-0.021). The introduced Highest- <math><mi>k</mi></math> loss function addresses the problem of prevailing risk prediction models and offers a practical solution that focuses on patients with the <math><mi>k</mi></math> highest predictive scores who can realistically receive an intervention as opposed to the entire patient population.</p>","PeriodicalId":74563,"journal":{"name":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","volume":"2023 ","pages":"3226-3233"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11821551/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. IEEE International Conference on Bioinformatics and Biomedicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/bibm58861.2023.10385816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/18 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Disease risk prediction models play an important role in preventing disease developments in modern healthcare. However, the lack of focus on high-risk patients has hindered the large-scale practical application of these models, especially considering the limitation of medical resources available for following up on patients who are deemed high-risk. In this study, we propose a novel and practical approach that focuses on minimizing the number of false positive observations among high-risk patients by introducing the Highest-k Loss. The solution is to estimate the weights of the highest scores with a differentiable estimation of the sorting operation and apply the weights to the loss function. We extracted 253,680 survey responses from a public dataset of the U.S. health survey system to define a diabetes prediction task. This study employs nested cross-validation as well as an aggregated model applied to an independent test set to systematically evaluate the proposed method. Compared with traditional binary cross entropy loss and Focal loss, the Highest- loss improved the precision (positive predictive value) for the highest 1% scores by 0.05 (95% CI: 0.041-0.055), the highest 5% scores by 0.03 (95% CI: 0.024-0.032), and the highest 10% scores by 0.02 (95% CI: 0.016-0.021). The introduced Highest- loss function addresses the problem of prevailing risk prediction models and offers a practical solution that focuses on patients with the highest predictive scores who can realistically receive an intervention as opposed to the entire patient population.