Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu
{"title":"Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records","authors":"Xueli Liu ,&nbsp;Wentao Gong ,&nbsp;Xiao Chen ,&nbsp;Zhen Li ,&nbsp;Yinlong Liu ,&nbsp;Li Wang ,&nbsp;Quan Liu ,&nbsp;Xicai Sun ,&nbsp;Xiaofeng Liu ,&nbsp;Xinrong Chen ,&nbsp;Yuxuan Shi ,&nbsp;Hongmeng Yu","doi":"10.1016/j.patcog.2025.111646","DOIUrl":null,"url":null,"abstract":"<div><div>Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111646"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003061","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.
使用未标记的内窥镜记录进行鼻部疾病诊断的视觉语言基础模型
医学人工智能(AI)在识别鼻内窥镜图像中的健康状况迹象方面具有巨大潜力,从而加速疾病和全身性疾病的诊断。然而,人工智能模型的性能严重依赖于专家注释,这些模型通常是特定于任务的,在各种临床应用中的泛化性能有限。在本文中,我们介绍了NasVLM,一个鼻视觉语言基础模型,旨在从未标记的鼻内窥镜数据中提取通用表示。此外,我们构建了一个大规模的鼻内窥镜预训练数据集和三个来自常规诊断记录的下游验证数据集。NasVLM的核心优势在于它能够学习跨模态语义表示和执行多粒度报告-图像对齐,而不依赖于专家注释。此外,据我们所知,它是第一个医学基础模型,可以有效地将医学报告与不同解剖区域的多个图像对齐,并通过精心设计的分层报告监督学习框架提供便利。实验结果表明,NasVLM在各种诊断任务中具有优越的泛化性能,并且在疾病分类和病灶定位方面优于最先进的自我和报告监督方法,特别是在需要标签高效微调的场景中。例如,NasVLM可以区分正常鼻咽(NOR)与异常(良性增生,BH和鼻咽癌,NPC),准确率为91.38% (95% CI, 90.59至92.17),并且在多中心NPC- screen数据集上,仅使用1%的标记数据就可以区分NPC与BH和NOR,准确率为81.45% (95% CI, 80.21至82.67),与使用100%标记数据的传统监督方法的性能相当。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Pattern Recognition
Pattern Recognition 工程技术-工程:电子与电气
CiteScore
14.40
自引率
16.20%
发文量
683
审稿时长
5.6 months
期刊介绍: The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信