Nathan M. Cross, Jessica Perry, Qifei Dong, Gang Luo, Jonathan Renslo, Brian C. Chang, Nancy E. Lane, Lynn Marshall, Sandra K. Johnston, David R. Haynor, Jeffrey G. Jarvik, Patrick J. Heagerty
{"title":"Subject-level spinal osteoporotic fracture prediction combining deep learning vertebral outputs and limited demographic data","authors":"Nathan M. Cross, Jessica Perry, Qifei Dong, Gang Luo, Jonathan Renslo, Brian C. Chang, Nancy E. Lane, Lynn Marshall, Sandra K. Johnston, David R. Haynor, Jeffrey G. Jarvik, Patrick J. Heagerty","doi":"10.1007/s11657-024-01433-z","DOIUrl":null,"url":null,"abstract":"<div><h3>\n <i>Summary</i>\n </h3><p>Automated screening for vertebral fractures could improve outcomes. We achieved an AUC-ROC = 0.968 for the prediction of moderate to severe fracture using a GAM with age and three maximal vertebral body scores of fracture from a convolutional neural network. Maximal fracture scores resulted in a performant model for subject-level fracture prediction. Combining individual deep learning vertebral body fracture scores and demographic covariates for subject-level classification of osteoporotic fracture achieved excellent performance (AUC-ROC of 0.968) on a large dataset of radiographs with basic demographic data.</p><h3>Purpose</h3><p>Osteoporotic vertebral fractures are common and morbid. Automated opportunistic screening for incidental vertebral fractures from radiographs, the highest volume imaging modality, could improve osteoporosis detection and management. We consider how to form patient-level fracture predictions and summarization to guide management, using our previously developed vertebral fracture classifier on segmented radiographs from a prospective cohort study of US men (MrOS). We compare the performance of logistic regression (LR) and generalized additive models (GAM) with combinations of individual vertebral scores and basic demographic covariates.</p><h3>Methods</h3><p>Subject-level LR and GAM models were created retrospectively using all fracture predictions or summary variables such as order statistics, adjacent vertebral interactions, and demographic covariates (age, race/ethnicity). The classifier outputs for 8663 vertebrae from 1176 thoracic and lumbar radiographs in 669 subjects were divided by subject to perform stratified fivefold cross-validation. Models were assessed using multiple metrics, including receiver operating characteristic (ROC) and precision-recall (PR) curves.</p><h3>Results</h3><p>The best model (AUC-ROC = 0.968) was a GAM using the top three maximum vertebral fracture scores and age. Using top-ranked scores only, rather than all vertebral scores, improved performance for both model classes. Adding age, but not ethnicity, to the GAMs improved performance slightly.</p><h3>Conclusion</h3><p>Maximal vertebral fracture scores resulted in the highest-performing models. While combining multiple vertebral body predictions risks decreasing specificity, our results demonstrate that subject-level models maintain good predictive performance. Thresholding strategies can be used to control sensitivity and specificity as clinically appropriate.</p></div>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"3","ListUrlMain":"https://link.springer.com/article/10.1007/s11657-024-01433-z","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Summary
Automated screening for vertebral fractures could improve outcomes. We achieved an AUC-ROC = 0.968 for the prediction of moderate to severe fracture using a GAM with age and three maximal vertebral body scores of fracture from a convolutional neural network. Maximal fracture scores resulted in a performant model for subject-level fracture prediction. Combining individual deep learning vertebral body fracture scores and demographic covariates for subject-level classification of osteoporotic fracture achieved excellent performance (AUC-ROC of 0.968) on a large dataset of radiographs with basic demographic data.
Purpose
Osteoporotic vertebral fractures are common and morbid. Automated opportunistic screening for incidental vertebral fractures from radiographs, the highest volume imaging modality, could improve osteoporosis detection and management. We consider how to form patient-level fracture predictions and summarization to guide management, using our previously developed vertebral fracture classifier on segmented radiographs from a prospective cohort study of US men (MrOS). We compare the performance of logistic regression (LR) and generalized additive models (GAM) with combinations of individual vertebral scores and basic demographic covariates.
Methods
Subject-level LR and GAM models were created retrospectively using all fracture predictions or summary variables such as order statistics, adjacent vertebral interactions, and demographic covariates (age, race/ethnicity). The classifier outputs for 8663 vertebrae from 1176 thoracic and lumbar radiographs in 669 subjects were divided by subject to perform stratified fivefold cross-validation. Models were assessed using multiple metrics, including receiver operating characteristic (ROC) and precision-recall (PR) curves.
Results
The best model (AUC-ROC = 0.968) was a GAM using the top three maximum vertebral fracture scores and age. Using top-ranked scores only, rather than all vertebral scores, improved performance for both model classes. Adding age, but not ethnicity, to the GAMs improved performance slightly.
Conclusion
Maximal vertebral fracture scores resulted in the highest-performing models. While combining multiple vertebral body predictions risks decreasing specificity, our results demonstrate that subject-level models maintain good predictive performance. Thresholding strategies can be used to control sensitivity and specificity as clinically appropriate.