Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu
{"title":"使用未标记的内窥镜记录进行鼻部疾病诊断的视觉语言基础模型","authors":"Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu","doi":"10.1016/j.patcog.2025.111646","DOIUrl":null,"url":null,"abstract":"<div><div>Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"165 ","pages":"Article 111646"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records\",\"authors\":\"Xueli Liu , Wentao Gong , Xiao Chen , Zhen Li , Yinlong Liu , Li Wang , Quan Liu , Xicai Sun , Xiaofeng Liu , Xinrong Chen , Yuxuan Shi , Hongmeng Yu\",\"doi\":\"10.1016/j.patcog.2025.111646\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"165 \",\"pages\":\"Article 111646\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320325003061\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325003061","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Vision-language foundation model for generalizable nasal disease diagnosis using unlabeled endoscopic records
Medical artificial intelligence (AI) holds significant potential in identifying signs of health conditions in nasal endoscopic images, thereby accelerating the diagnosis of diseases and systemic disorders. However, the performance of AI models heavily relies on expert annotations, and these models are usually task-specific with limited generalization performance across various clinical applications. In this paper, we introduce NasVLM, a Nasal Vision-Language foundation Model designed to extract universal representations from unlabeled nasal endoscopic data. Additionally, we construct a large-scale nasal endoscopic pre-training dataset and three downstream validation datasets from routine diagnostic records. The core strength of NasVLM lies in its ability to learn cross-modal semantic representations and perform multi-granular report-image alignment without depending on expert annotations. Furthermore, to the best of our knowledge, it is the first medical foundation model that effectively aligns medical report with multiple images of different anatomic regions, facilitated by a well-designed hierarchical report-supervised learning framework. The experimental results demonstrate that NasVLM has superior generalization performance across diverse diagnostic tasks and surpasses state-of-the-art self- and report-supervised methods in disease classification and lesion localization, especially in scenarios requiring label-efficient fine-tuning. For instance, NasVLM can distinguish normal nasopharynx (NOR) from abnormalities (benign hyperplasia, BH, and nasopharyngeal carcinoma, NPC) with an accuracy of 91.38% (95% CI, 90.59 to 92.17) and differentiate NPC from BH and NOR with an accuracy of 81.45% (95% CI, 80.21 to 82.67) on the multi-center NPC-Screen dataset using only 1% labeled data, on par with the performance of traditional supervised methods using 100% labeled data.
期刊介绍:
The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.