M.S. White , K.T. Gao , V. Pedoia , S. Majumdar , G.E. Gold , A.S. Chaudhari , A.A. Gatti
{"title":"TOWARD OPENLY AVAILABLE KNEE MRI SEGMENTATIONS FOR THE OAI: MULTI-MODEL EVALUATION AND CONSENSUS GENERATION ON 9,360 SCANS","authors":"M.S. White , K.T. Gao , V. Pedoia , S. Majumdar , G.E. Gold , A.S. Chaudhari , A.A. Gatti","doi":"10.1016/j.ostima.2025.100330","DOIUrl":null,"url":null,"abstract":"<div><h3>INTRODUCTION</h3><div>Many deep learning methods exist for segmentation of bone and cartilage in knee MRI, but their agreement and impact on quantitative metrics (e.g., cartilage thickness) remain unclear. Prior studies have not investigated whether combining segmentations from independent deep learning models can improve sensitivity to detect clinically relevant differences. Understanding these effects in large cohorts is essential to guide deep learning in OA research and clinical trials.</div></div><div><h3>OBJECTIVE</h3><div>To generate consensus segmentations from independent deep learning models developed at Stanford and UCSF, evaluate agreement between bone and cartilage segmentations across all models, and assess each method’s sensitivity to detect cartilage thickness differences between KL2 and KL3 knees.</div></div><div><h3>METHODS</h3><div>Bone and cartilage segmentations of 9360 knees from the OAI baseline dataset were independently generated in prior work by Stanford and UCSF using separately validated deep learning models. A consensus segmentation was generated using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, with the threshold tuned to minimize cartilage volume differences between the two models. Segmentations were compared using volume differences (%), Dice Similarity Coefficient (DSC), and average symmetric surface distance (ASSD). Mean cartilage thickness was computed in sub-regions (femur: anterior, medial/lateral weight-bearing, posterior; tibia: medial and lateral, and patella) and compared using Pearson correlations and intraclass correlation coefficients (ICC). Each method’s (UCSF, Stanford, and STAPLE’s) sensitivity to detect between group (KL2 and KL3) differences in cartilage thickness was assed using effect sizes (Cohen’s d).</div></div><div><h3>RESULTS</h3><div>Comparing Stanford and UCSF models, bone demonstrated better overlap (DSC = 0.95-0.97) compared to cartilage (DSC = 0.79-0.82). However, cartilage had smaller volume differences (-0.2-1.9% vs. 2.5-6.2%) and lower ASSD (0.24-0.33 mm vs. 0.33-0.47 mm) relative to bone. Both Stanford vs. STAPLE and UCSF vs. STAPLE yielded better segmentation agreement (higher DSC, lower ASSD) compared to Stanford vs. UCSF, despite larger volume differences (Table 1A). Compared to one another, Stanford and UCSF cartilage thickness measurements had high correlation (r = 0.96-0.99) and agreement (ICC = 0.96-0.99, mean differences < 0.04 mm). STAPLE produced systematically greater thickness values (mean difference = 0.16 ± 0.08 mm), and slightly lower ICCs (ICC = 0.88-0.96), and correlations (r = 0.92-.97) when compared with Stanford or UCSF. Effect sizes for mean cartilage thickness between KL2 and KL3 knees were small (Cohen’s d < 0.5), except for the medial weight-bearing femur, which had moderate effects for Stanford (-0.60) and UCSF (-0.58), and small-to-moderate for STAPLE (-0.48; Table 1B).</div></div><div><h3>CONCLUSION</h3><div>Cartilage thickness measurements were highly correlated across methods and regions, indicating preservation of key quantitative information, despite lower DSC between Stanford and UCSF and lower absolute agreement in thickness (ICC) between STAPLE and each method. Importantly, STAPLE slightly reduced sensitivity to detect changes in medial weight-bearing femoral cartilage. Leveraging the many other existing OAI DESS segmentation models has the potential to further improve the consensus. Future work will refine the consensus segmentations, scale analyses to the full OAI dataset, and open source the resulting consensus segmentation masks.</div></div>","PeriodicalId":74378,"journal":{"name":"Osteoarthritis imaging","volume":"5 ","pages":"Article 100330"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Osteoarthritis imaging","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772654125000704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
INTRODUCTION
Many deep learning methods exist for segmentation of bone and cartilage in knee MRI, but their agreement and impact on quantitative metrics (e.g., cartilage thickness) remain unclear. Prior studies have not investigated whether combining segmentations from independent deep learning models can improve sensitivity to detect clinically relevant differences. Understanding these effects in large cohorts is essential to guide deep learning in OA research and clinical trials.
OBJECTIVE
To generate consensus segmentations from independent deep learning models developed at Stanford and UCSF, evaluate agreement between bone and cartilage segmentations across all models, and assess each method’s sensitivity to detect cartilage thickness differences between KL2 and KL3 knees.
METHODS
Bone and cartilage segmentations of 9360 knees from the OAI baseline dataset were independently generated in prior work by Stanford and UCSF using separately validated deep learning models. A consensus segmentation was generated using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, with the threshold tuned to minimize cartilage volume differences between the two models. Segmentations were compared using volume differences (%), Dice Similarity Coefficient (DSC), and average symmetric surface distance (ASSD). Mean cartilage thickness was computed in sub-regions (femur: anterior, medial/lateral weight-bearing, posterior; tibia: medial and lateral, and patella) and compared using Pearson correlations and intraclass correlation coefficients (ICC). Each method’s (UCSF, Stanford, and STAPLE’s) sensitivity to detect between group (KL2 and KL3) differences in cartilage thickness was assed using effect sizes (Cohen’s d).
RESULTS
Comparing Stanford and UCSF models, bone demonstrated better overlap (DSC = 0.95-0.97) compared to cartilage (DSC = 0.79-0.82). However, cartilage had smaller volume differences (-0.2-1.9% vs. 2.5-6.2%) and lower ASSD (0.24-0.33 mm vs. 0.33-0.47 mm) relative to bone. Both Stanford vs. STAPLE and UCSF vs. STAPLE yielded better segmentation agreement (higher DSC, lower ASSD) compared to Stanford vs. UCSF, despite larger volume differences (Table 1A). Compared to one another, Stanford and UCSF cartilage thickness measurements had high correlation (r = 0.96-0.99) and agreement (ICC = 0.96-0.99, mean differences < 0.04 mm). STAPLE produced systematically greater thickness values (mean difference = 0.16 ± 0.08 mm), and slightly lower ICCs (ICC = 0.88-0.96), and correlations (r = 0.92-.97) when compared with Stanford or UCSF. Effect sizes for mean cartilage thickness between KL2 and KL3 knees were small (Cohen’s d < 0.5), except for the medial weight-bearing femur, which had moderate effects for Stanford (-0.60) and UCSF (-0.58), and small-to-moderate for STAPLE (-0.48; Table 1B).
CONCLUSION
Cartilage thickness measurements were highly correlated across methods and regions, indicating preservation of key quantitative information, despite lower DSC between Stanford and UCSF and lower absolute agreement in thickness (ICC) between STAPLE and each method. Importantly, STAPLE slightly reduced sensitivity to detect changes in medial weight-bearing femoral cartilage. Leveraging the many other existing OAI DESS segmentation models has the potential to further improve the consensus. Future work will refine the consensus segmentations, scale analyses to the full OAI dataset, and open source the resulting consensus segmentation masks.