TOWARD OPENLY AVAILABLE KNEE MRI SEGMENTATIONS FOR THE OAI: MULTI-MODEL EVALUATION AND CONSENSUS GENERATION ON 9,360 SCANS

Osteoarthritis imaging Pub Date : 2025-01-01 DOI:10.1016/j.ostima.2025.100330

M.S. White , K.T. Gao , V. Pedoia , S. Majumdar , G.E. Gold , A.S. Chaudhari , A.A. Gatti

{"title":"TOWARD OPENLY AVAILABLE KNEE MRI SEGMENTATIONS FOR THE OAI: MULTI-MODEL EVALUATION AND CONSENSUS GENERATION ON 9,360 SCANS","authors":"M.S. White , K.T. Gao , V. Pedoia , S. Majumdar , G.E. Gold , A.S. Chaudhari , A.A. Gatti","doi":"10.1016/j.ostima.2025.100330","DOIUrl":null,"url":null,"abstract":"<div><h3>INTRODUCTION</h3><div>Many deep learning methods exist for segmentation of bone and cartilage in knee MRI, but their agreement and impact on quantitative metrics (e.g., cartilage thickness) remain unclear. Prior studies have not investigated whether combining segmentations from independent deep learning models can improve sensitivity to detect clinically relevant differences. Understanding these effects in large cohorts is essential to guide deep learning in OA research and clinical trials.</div></div><div><h3>OBJECTIVE</h3><div>To generate consensus segmentations from independent deep learning models developed at Stanford and UCSF, evaluate agreement between bone and cartilage segmentations across all models, and assess each method’s sensitivity to detect cartilage thickness differences between KL2 and KL3 knees.</div></div><div><h3>METHODS</h3><div>Bone and cartilage segmentations of 9360 knees from the OAI baseline dataset were independently generated in prior work by Stanford and UCSF using separately validated deep learning models. A consensus segmentation was generated using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, with the threshold tuned to minimize cartilage volume differences between the two models. Segmentations were compared using volume differences (%), Dice Similarity Coefficient (DSC), and average symmetric surface distance (ASSD). Mean cartilage thickness was computed in sub-regions (femur: anterior, medial/lateral weight-bearing, posterior; tibia: medial and lateral, and patella) and compared using Pearson correlations and intraclass correlation coefficients (ICC). Each method’s (UCSF, Stanford, and STAPLE’s) sensitivity to detect between group (KL2 and KL3) differences in cartilage thickness was assed using effect sizes (Cohen’s d).</div></div><div><h3>RESULTS</h3><div>Comparing Stanford and UCSF models, bone demonstrated better overlap (DSC = 0.95-0.97) compared to cartilage (DSC = 0.79-0.82). However, cartilage had smaller volume differences (-0.2-1.9% vs. 2.5-6.2%) and lower ASSD (0.24-0.33 mm vs. 0.33-0.47 mm) relative to bone. Both Stanford vs. STAPLE and UCSF vs. STAPLE yielded better segmentation agreement (higher DSC, lower ASSD) compared to Stanford vs. UCSF, despite larger volume differences (Table 1A). Compared to one another, Stanford and UCSF cartilage thickness measurements had high correlation (r = 0.96-0.99) and agreement (ICC = 0.96-0.99, mean differences < 0.04 mm). STAPLE produced systematically greater thickness values (mean difference = 0.16 ± 0.08 mm), and slightly lower ICCs (ICC = 0.88-0.96), and correlations (r = 0.92-.97) when compared with Stanford or UCSF. Effect sizes for mean cartilage thickness between KL2 and KL3 knees were small (Cohen’s d < 0.5), except for the medial weight-bearing femur, which had moderate effects for Stanford (-0.60) and UCSF (-0.58), and small-to-moderate for STAPLE (-0.48; Table 1B).</div></div><div><h3>CONCLUSION</h3><div>Cartilage thickness measurements were highly correlated across methods and regions, indicating preservation of key quantitative information, despite lower DSC between Stanford and UCSF and lower absolute agreement in thickness (ICC) between STAPLE and each method. Importantly, STAPLE slightly reduced sensitivity to detect changes in medial weight-bearing femoral cartilage. Leveraging the many other existing OAI DESS segmentation models has the potential to further improve the consensus. Future work will refine the consensus segmentations, scale analyses to the full OAI dataset, and open source the resulting consensus segmentation masks.</div></div>","PeriodicalId":74378,"journal":{"name":"Osteoarthritis imaging","volume":"5 ","pages":"Article 100330"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Osteoarthritis imaging","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772654125000704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

INTRODUCTION

Many deep learning methods exist for segmentation of bone and cartilage in knee MRI, but their agreement and impact on quantitative metrics (e.g., cartilage thickness) remain unclear. Prior studies have not investigated whether combining segmentations from independent deep learning models can improve sensitivity to detect clinically relevant differences. Understanding these effects in large cohorts is essential to guide deep learning in OA research and clinical trials.

OBJECTIVE

To generate consensus segmentations from independent deep learning models developed at Stanford and UCSF, evaluate agreement between bone and cartilage segmentations across all models, and assess each method’s sensitivity to detect cartilage thickness differences between KL2 and KL3 knees.

METHODS

Bone and cartilage segmentations of 9360 knees from the OAI baseline dataset were independently generated in prior work by Stanford and UCSF using separately validated deep learning models. A consensus segmentation was generated using the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, with the threshold tuned to minimize cartilage volume differences between the two models. Segmentations were compared using volume differences (%), Dice Similarity Coefficient (DSC), and average symmetric surface distance (ASSD). Mean cartilage thickness was computed in sub-regions (femur: anterior, medial/lateral weight-bearing, posterior; tibia: medial and lateral, and patella) and compared using Pearson correlations and intraclass correlation coefficients (ICC). Each method’s (UCSF, Stanford, and STAPLE’s) sensitivity to detect between group (KL2 and KL3) differences in cartilage thickness was assed using effect sizes (Cohen’s d).

RESULTS

Comparing Stanford and UCSF models, bone demonstrated better overlap (DSC = 0.95-0.97) compared to cartilage (DSC = 0.79-0.82). However, cartilage had smaller volume differences (-0.2-1.9% vs. 2.5-6.2%) and lower ASSD (0.24-0.33 mm vs. 0.33-0.47 mm) relative to bone. Both Stanford vs. STAPLE and UCSF vs. STAPLE yielded better segmentation agreement (higher DSC, lower ASSD) compared to Stanford vs. UCSF, despite larger volume differences (Table 1A). Compared to one another, Stanford and UCSF cartilage thickness measurements had high correlation (r = 0.96-0.99) and agreement (ICC = 0.96-0.99, mean differences < 0.04 mm). STAPLE produced systematically greater thickness values (mean difference = 0.16 ± 0.08 mm), and slightly lower ICCs (ICC = 0.88-0.96), and correlations (r = 0.92-.97) when compared with Stanford or UCSF. Effect sizes for mean cartilage thickness between KL2 and KL3 knees were small (Cohen’s d < 0.5), except for the medial weight-bearing femur, which had moderate effects for Stanford (-0.60) and UCSF (-0.58), and small-to-moderate for STAPLE (-0.48; Table 1B).

CONCLUSION

Cartilage thickness measurements were highly correlated across methods and regions, indicating preservation of key quantitative information, despite lower DSC between Stanford and UCSF and lower absolute agreement in thickness (ICC) between STAPLE and each method. Importantly, STAPLE slightly reduced sensitivity to detect changes in medial weight-bearing femoral cartilage. Leveraging the many other existing OAI DESS segmentation models has the potential to further improve the consensus. Future work will refine the consensus segmentations, scale analyses to the full OAI dataset, and open source the resulting consensus segmentation masks.

查看原文本刊更多论文

面向开放的膝关节mri分割：9360次扫描的多模型评估和共识生成

在膝关节MRI中存在许多用于分割骨和软骨的深度学习方法，但它们的一致性和对定量指标（例如软骨厚度）的影响尚不清楚。之前的研究并没有研究结合独立深度学习模型的分割是否可以提高检测临床相关差异的灵敏度。在大型队列中了解这些影响对于指导OA研究和临床试验中的深度学习至关重要。目的从斯坦福大学和加州大学旧金山分校开发的独立深度学习模型中生成共识分割，评估所有模型中骨和软骨分割的一致性，并评估每种方法在检测KL2和KL3膝关节软骨厚度差异方面的敏感性。方法在之前的工作中，斯坦福大学和加州大学旧金山分校分别使用经过验证的深度学习模型，独立生成来自OAI基线数据集的9360个膝关节的骨和软骨分割。使用同步真实性和性能水平估计（STAPLE）算法生成共识分割，并调整阈值以最小化两种模型之间的软骨体积差异。使用体积差（%）、骰子相似系数（DSC）和平均对称表面距离（ASSD）对分割进行比较。计算子区域的平均软骨厚度(股骨：前部、内侧/外侧负重、后部；胫骨：内侧和外侧，髌骨)，并使用Pearson相关性和类内相关系数（ICC）进行比较。每种方法（UCSF、Stanford和STAPLE）检测组（KL2和KL3）之间软骨厚度差异的灵敏度采用效应量（Cohen’s d）。结果对比Stanford和UCSF模型，骨的重叠程度（DSC = 0.95-0.97）优于软骨（DSC = 0.79-0.82）。然而，软骨相对于骨的体积差异较小（-0.2-1.9% vs. 2.5-6.2%）， ASSD较低（0.24-0.33 mm vs. 0.33-0.47 mm）。尽管体积差异较大（表1A），但与斯坦福大学与UCSF相比，斯坦福大学与STAPLE和UCSF与STAPLE的分割一致性更好（更高的DSC，更低的ASSD）。Stanford和UCSF的软骨厚度测量结果相互比较具有较高的相关性（r = 0.96-0.99）和一致性(ICC = 0.96-0.99，平均差异<；0.04毫米)。与斯坦福大学或UCSF相比，STAPLE产生了系统性更大的厚度值（平均差 = 0.16±0.08 mm）， ICCs （ICC = 0.88-0.96）和相关性（r = 0.92- 0.97）略低。KL2和KL3膝关节之间平均软骨厚度的效应值较小(Cohen 's d <；0.5)，但内侧负重股骨除外，其对Stanford（-0.60）和UCSF（-0.58）的影响中等，对STAPLE (-0.48；表1 b)。尽管Stanford和UCSF之间的DSC较低，STAPLE和每种方法之间的厚度绝对一致性（ICC）较低，但不同方法和地区的软骨厚度测量结果高度相关，表明关键的定量信息得到了保留。重要的是，STAPLE略微降低了检测内侧负重股软骨变化的敏感性。利用许多其他现有的OAI DESS分割模型有可能进一步改进共识。未来的工作将细化共识分割，将分析扩展到完整的OAI数据集，并开放最终的共识分割掩码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Osteoarthritis imaging Radiology and Imaging

自引率

0.00%

发文量