Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557677

Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang

{"title":"Leveraging Concise Concepts With Probabilistic Modeling for Interpretable Visual Recognition","authors":"Yixuan Zhang;Chuanbin Liu;Yizhi Liu;Yifan Gao;Zhiying Lu;Hongtao Xie;Yongdong Zhang","doi":"10.1109/TMM.2025.3557677","DOIUrl":null,"url":null,"abstract":"Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"3117-3131"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948345/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Interpretable visual recognition is essential for decision-making in high-stakes situations. Recent advancements have automated the construction of interpretable models by leveraging Visual Language Models (VLMs) and Large Language Models (LLMs) with Concept Bottleneck Models (CBMs), which process a bottleneck layer associated with human-understandable concepts. However, existing methods suffer from two main problems: a) the collected concepts from LLMs could be redundant with task-irrelevant descriptions, resulting in an inferior concept space with potential mismatch. b) VLMs directly map the global deterministic image embeddings with fine-grained concepts results in an ambiguous process with imprecise mapping results. To address the above two issues, we propose a novel solution for CBMs with Concise Concept and Probabilistic Modeling (CCPM) that can achieve superior classification performance via high-quality concepts and precise mapping strategy. First, we leverage in-context examples as category-related clues to guide LLM concept generation process. To mitigate redundancy in the concept space, we propose a Relation-Aware Selection (RAS) module to obtain a concise concept set that is discriminative and relevant based on image-concept and inter-concept relationships. Second, for precise mapping, we employ a Probabilistic Distribution Adapter (PDA) that estimates the inherent ambiguity of the image embeddings of pre-trained VLMs to capture the complex relationships with concepts. Extensive experiments indicate that our model achieves state-of-the-art results with a 6.18% improvement in classification accuracy on eight mainstream recognition benchmarks as well as reliable explainability through interpretable analysis.

查看原文本刊更多论文

利用简洁的概念和概率模型进行可解释的视觉识别

可解释的视觉识别对于高风险情况下的决策至关重要。最近的进展通过利用带有概念瓶颈模型（cbm）的可视化语言模型（vlm）和大型语言模型（llm）实现了可解释模型的自动化构建，后者处理与人类可理解概念相关的瓶颈层。然而，现有方法存在两个主要问题：a)从llm中收集的概念可能与任务无关的描述冗余，从而导致潜在不匹配的劣质概念空间。b) vlm直接用细粒度概念映射全局确定性图像嵌入，导致映射结果不精确的模糊过程。为了解决上述两个问题，我们提出了一种基于简明概念和概率建模（CCPM）的CBMs新解决方案，该方案可以通过高质量的概念和精确的映射策略实现卓越的分类性能。首先，我们利用上下文示例作为与类别相关的线索来指导法学硕士概念生成过程。为了减少概念空间中的冗余，我们提出了一个关系感知选择（RAS）模块，以获得基于图像-概念和概念间关系的判别性和相关性的简洁概念集。其次，对于精确映射，我们使用概率分布适配器（PDA）来估计预训练vlm图像嵌入的固有模糊性，以捕获与概念的复杂关系。大量实验表明，我们的模型达到了最先进的结果，在8个主流识别基准上的分类准确率提高了6.18%，并且通过可解释性分析获得了可靠的可解释性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.