Mohammadreza Chavoshi, Hari Trivedi, Janice Newsome, Aawez Mansuri, Frank Li, Theo Dapamede, Bardia Khosravi, Judy Gichoya
{"title":"Comparative Evaluation of Deep Learning and Foundation Model Embeddings for Osteoarthritis Feature Classification in Knee Radiographs.","authors":"Mohammadreza Chavoshi, Hari Trivedi, Janice Newsome, Aawez Mansuri, Frank Li, Theo Dapamede, Bardia Khosravi, Judy Gichoya","doi":"10.1007/s10278-025-01636-x","DOIUrl":null,"url":null,"abstract":"<p><p>Foundation models (FM) offer a promising alternative to supervised deep learning (DL) by enabling greater flexibility and generalizability without relying on large, labeled datasets. This study investigates the performance of supervised DL models and pre-trained FM embeddings in classifying radiographic features related to knee osteoarthritis. We analyzed 44,985 knee radiographs from the Osteoarthritis Initiative dataset. Two convolutional neural network models (ResNet18 and ConvNeXt-Small) were trained to classify osteophytes, joint space narrowing, subchondral sclerosis, and Kellgren-Lawrence grades (KLG). These models were compared against two FM: BiomedCLIP, a multimodal vision-language model pre-trained on diverse medical images and text, and RAD-DINO vision transformer model pre-trained exclusively on chest radiographs. We extracted image embeddings from both FMs and used XGBoost classifiers to perform downstream classification. Performance was assessed using a comprehensive classification metrics appropriate for binary and multi-class classification tasks. DL models outperformed FM-based approaches across all tasks. ConvNeXt achieved the highest performance in predicting KLG, with a weighted Cohen's kappa of 0.880 and higher AUC in binary tasks. BiomedCLIP and RAD-DINO performed similarly, and BiomedCLIP's prior exposure to knee radiographs during pretraining led to only slight improvements. Zero-shot classification using BiomedCLIP correctly identified 91.14% of knee radiographs, with most failures associated with low image quality. Grad-CAM visualizations revealed DL models, particularly ConvNeXt, reliably focused on clinically relevant regions. While FMs offer promising utility in auxiliary imaging tasks, supervised DL remains superior for fine-grained radiographic feature classification in domains with limited pretraining representation, such as musculoskeletal imaging.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of imaging informatics in medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s10278-025-01636-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Foundation models (FM) offer a promising alternative to supervised deep learning (DL) by enabling greater flexibility and generalizability without relying on large, labeled datasets. This study investigates the performance of supervised DL models and pre-trained FM embeddings in classifying radiographic features related to knee osteoarthritis. We analyzed 44,985 knee radiographs from the Osteoarthritis Initiative dataset. Two convolutional neural network models (ResNet18 and ConvNeXt-Small) were trained to classify osteophytes, joint space narrowing, subchondral sclerosis, and Kellgren-Lawrence grades (KLG). These models were compared against two FM: BiomedCLIP, a multimodal vision-language model pre-trained on diverse medical images and text, and RAD-DINO vision transformer model pre-trained exclusively on chest radiographs. We extracted image embeddings from both FMs and used XGBoost classifiers to perform downstream classification. Performance was assessed using a comprehensive classification metrics appropriate for binary and multi-class classification tasks. DL models outperformed FM-based approaches across all tasks. ConvNeXt achieved the highest performance in predicting KLG, with a weighted Cohen's kappa of 0.880 and higher AUC in binary tasks. BiomedCLIP and RAD-DINO performed similarly, and BiomedCLIP's prior exposure to knee radiographs during pretraining led to only slight improvements. Zero-shot classification using BiomedCLIP correctly identified 91.14% of knee radiographs, with most failures associated with low image quality. Grad-CAM visualizations revealed DL models, particularly ConvNeXt, reliably focused on clinically relevant regions. While FMs offer promising utility in auxiliary imaging tasks, supervised DL remains superior for fine-grained radiographic feature classification in domains with limited pretraining representation, such as musculoskeletal imaging.