学习人类注意力，实现属性辅助视觉识别

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-09-11 DOI:10.1109/TPAMI.2024.3458921

Xiao Bai;Pengcheng Zhang;Xiaohan Yu;Jin Zheng;Edwin R. Hancock;Jun Zhou;Lin Gu

{"title":"学习人类注意力，实现属性辅助视觉识别","authors":"Xiao Bai;Pengcheng Zhang;Xiaohan Yu;Jin Zheng;Edwin R. Hancock;Jun Zhou;Lin Gu","doi":"10.1109/TPAMI.2024.3458921","DOIUrl":null,"url":null,"abstract":"With prior knowledge of seen objects, humans have a remarkable ability to recognize novel objects using shared and distinct local attributes. This is significant for the challenging tasks of zero-shot learning (ZSL) and fine-grained visual classification (FGVC), where the discriminative attributes of objects have played an important role. Inspired by human visual attention, neural networks have widely exploited the attention mechanism to learn the locally discriminative attributes for challenging tasks. Though greatly promoted the development of these fields, existing works mainly focus on learning the region embeddings of different attribute features and neglect the importance of discriminative attribute localization. It is also unclear whether the learned attention truly matches the real human attention. To tackle this problem, this paper proposes to employ real human gaze data for visual recognition networks to learn from human attention. Specifically, we design a unified Attribute Attention Network (A\n<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\nNet) that learns from human attention for both ZSL and FGVC tasks. The overall model consists of an attribute attention branch and a baseline classification network. On top of the image feature maps provided by the baseline classification network, the attribute attention branch employs attribute prototypes to produce attribute attention maps and attribute features. The attribute attention maps are converted to gaze-like attentions to be aligned with real human gaze attention. To guarantee the effectiveness of attribute feature learning, we further align the extracted attribute features with attribute-defined class embeddings. To facilitate learning from human gaze attention for the visual recognition problems, we design a bird classification game to collect real human gaze data using the CUB dataset via an eye-tracker device. Experiments on ZSL and FGVC tasks without/with real human gaze data validate the benefits and accuracy of our proposed model. This work supports the promising benefits of collecting human gaze datasets and automatic gaze estimation algorithms learning from human attention for high-level computer vision tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning From Human Attention for Attribute-Assisted Visual Recognition\",\"authors\":\"Xiao Bai;Pengcheng Zhang;Xiaohan Yu;Jin Zheng;Edwin R. Hancock;Jun Zhou;Lin Gu\",\"doi\":\"10.1109/TPAMI.2024.3458921\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With prior knowledge of seen objects, humans have a remarkable ability to recognize novel objects using shared and distinct local attributes. This is significant for the challenging tasks of zero-shot learning (ZSL) and fine-grained visual classification (FGVC), where the discriminative attributes of objects have played an important role. Inspired by human visual attention, neural networks have widely exploited the attention mechanism to learn the locally discriminative attributes for challenging tasks. Though greatly promoted the development of these fields, existing works mainly focus on learning the region embeddings of different attribute features and neglect the importance of discriminative attribute localization. It is also unclear whether the learned attention truly matches the real human attention. To tackle this problem, this paper proposes to employ real human gaze data for visual recognition networks to learn from human attention. Specifically, we design a unified Attribute Attention Network (A\\n<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\\nNet) that learns from human attention for both ZSL and FGVC tasks. The overall model consists of an attribute attention branch and a baseline classification network. On top of the image feature maps provided by the baseline classification network, the attribute attention branch employs attribute prototypes to produce attribute attention maps and attribute features. The attribute attention maps are converted to gaze-like attentions to be aligned with real human gaze attention. To guarantee the effectiveness of attribute feature learning, we further align the extracted attribute features with attribute-defined class embeddings. To facilitate learning from human gaze attention for the visual recognition problems, we design a bird classification game to collect real human gaze data using the CUB dataset via an eye-tracker device. Experiments on ZSL and FGVC tasks without/with real human gaze data validate the benefits and accuracy of our proposed model. This work supports the promising benefits of collecting human gaze datasets and automatic gaze estimation algorithms learning from human attention for high-level computer vision tasks.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10678838/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10678838/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

有了关于已见物体的先验知识，人类利用共享和独特的局部属性识别新物体的能力非同一般。这对于零镜头学习（ZSL）和细粒度视觉分类（FGVC）等具有挑战性的任务来说意义重大，因为在这些任务中，物体的判别属性发挥了重要作用。受人类视觉注意力的启发，神经网络广泛利用注意力机制来学习具有挑战性任务的局部判别属性。现有研究虽然极大地推动了这些领域的发展，但主要集中于学习不同属性特征的区域嵌入，而忽视了辨别属性定位的重要性。此外，学习到的注意力是否真正符合真实的人类注意力也不清楚。为了解决这个问题，本文提出利用真实的人类注视数据，让视觉识别网络从人类注意力中学习。具体来说，我们设计了一个统一的属性注意力网络（A$^{2}$Net），它可以在 ZSL 和 FGVC 任务中学习人类注意力。整个模型由一个属性注意分支和一个基线分类网络组成。在基线分类网络提供的图像特征图基础上，属性注意分支利用属性原型生成属性注意图和属性特征。属性注意力图被转换为类似凝视的注意力，以便与真实的人类凝视注意力保持一致。为了保证属性特征学习的有效性，我们进一步将提取的属性特征与属性定义的类嵌入对齐。为了便于从人类凝视注意力中学习视觉识别问题，我们设计了一个鸟类分类游戏，通过眼球追踪设备使用 CUB 数据集收集真实的人类凝视数据。在没有/有真实人类注视数据的 ZSL 和 FGVC 任务上进行的实验验证了我们提出的模型的优势和准确性。这项工作证明了收集人类注视数据集和从人类注意力中学习自动注视估计算法对高级计算机视觉任务的巨大好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Learning From Human Attention for Attribute-Assisted Visual Recognition

With prior knowledge of seen objects, humans have a remarkable ability to recognize novel objects using shared and distinct local attributes. This is significant for the challenging tasks of zero-shot learning (ZSL) and fine-grained visual classification (FGVC), where the discriminative attributes of objects have played an important role. Inspired by human visual attention, neural networks have widely exploited the attention mechanism to learn the locally discriminative attributes for challenging tasks. Though greatly promoted the development of these fields, existing works mainly focus on learning the region embeddings of different attribute features and neglect the importance of discriminative attribute localization. It is also unclear whether the learned attention truly matches the real human attention. To tackle this problem, this paper proposes to employ real human gaze data for visual recognition networks to learn from human attention. Specifically, we design a unified Attribute Attention Network (A

$^{2}$

Net) that learns from human attention for both ZSL and FGVC tasks. The overall model consists of an attribute attention branch and a baseline classification network. On top of the image feature maps provided by the baseline classification network, the attribute attention branch employs attribute prototypes to produce attribute attention maps and attribute features. The attribute attention maps are converted to gaze-like attentions to be aligned with real human gaze attention. To guarantee the effectiveness of attribute feature learning, we further align the extracted attribute features with attribute-defined class embeddings. To facilitate learning from human gaze attention for the visual recognition problems, we design a bird classification game to collect real human gaze data using the CUB dataset via an eye-tracker device. Experiments on ZSL and FGVC tasks without/with real human gaze data validate the benefits and accuracy of our proposed model. This work supports the promising benefits of collecting human gaze datasets and automatic gaze estimation algorithms learning from human attention for high-level computer vision tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量