Scene Grammar in Human and Machine Recognition of Objects and Scenes

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) Pub Date : 2018-06-01 DOI:10.1109/CVPRW.2018.00268

Akram Bayat, D. Koh, Anubhaw Kumar Nand, Marta Pereira, M. Pomplun

{"title":"Scene Grammar in Human and Machine Recognition of Objects and Scenes","authors":"Akram Bayat, D. Koh, Anubhaw Kumar Nand, Marta Pereira, M. Pomplun","doi":"10.1109/CVPRW.2018.00268","DOIUrl":null,"url":null,"abstract":"In this paper, we study the effects of violating the high level scene syntactic and semantic rules on human eye-movement behavior and deep neural scene and object recognition networks. An eye-movement experimental study was conducted with twenty human subjects to view scenes from the SCEGRAM image database and determine whether there is an inconsistent object or not. We examine the contribution of multiple types of features that influence eye movements while searching for an inconsistent object in a scene (e.g., size and location of an object) by evaluating the consistency prediction power of the trained classifiers on fixation features. The results of the eye movement analysis and inconsistency prediction reveal that: 1) inconsistent objects are fixated significantly more than consistent objects in a scene, 2) the distribution of fixations is the main factor that is influenced by the inconsistency condition of a scene which is reflected in the ground truth fixation maps. It is also observed that the performance of deep object and scene recognition networks drops due to the violations of scene grammar. The class-specific visual saliency maps are created from the high-level representation of the convolutional layers of a deep network during the scene and object recognition process. We discuss whether the scene inconsistencies are represented in those saliency maps by evaluating their prediction powers using multiple well-known metrics including AUC, SIM, and KL. The results suggest that an inconsistent object in a scene causes significant variations in the prediction power of saliency maps.","PeriodicalId":150600,"journal":{"name":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPRW.2018.00268","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In this paper, we study the effects of violating the high level scene syntactic and semantic rules on human eye-movement behavior and deep neural scene and object recognition networks. An eye-movement experimental study was conducted with twenty human subjects to view scenes from the SCEGRAM image database and determine whether there is an inconsistent object or not. We examine the contribution of multiple types of features that influence eye movements while searching for an inconsistent object in a scene (e.g., size and location of an object) by evaluating the consistency prediction power of the trained classifiers on fixation features. The results of the eye movement analysis and inconsistency prediction reveal that: 1) inconsistent objects are fixated significantly more than consistent objects in a scene, 2) the distribution of fixations is the main factor that is influenced by the inconsistency condition of a scene which is reflected in the ground truth fixation maps. It is also observed that the performance of deep object and scene recognition networks drops due to the violations of scene grammar. The class-specific visual saliency maps are created from the high-level representation of the convolutional layers of a deep network during the scene and object recognition process. We discuss whether the scene inconsistencies are represented in those saliency maps by evaluating their prediction powers using multiple well-known metrics including AUC, SIM, and KL. The results suggest that an inconsistent object in a scene causes significant variations in the prediction power of saliency maps.

查看原文本刊更多论文

场景语法在人和机器识别对象和场景中的应用

本文研究了违反高级场景语法和语义规则对人眼运动行为和深度神经场景和物体识别网络的影响。对20名受试者进行眼动实验研究，观察sceggram图像数据库中的场景，判断是否存在不一致的物体。我们通过评估训练分类器对注视特征的一致性预测能力，研究了在搜索场景中不一致物体(例如，物体的大小和位置)时影响眼球运动的多种类型特征的贡献。眼动分析和不一致预测结果表明:1)场景中不一致物体的注视量显著高于一致物体的注视量;2)注视量的分布是受场景不一致条件影响的主要因素，这种不一致条件反映在地面真值注视图中。我们还观察到，由于违反场景语法，深度对象和场景识别网络的性能下降。在场景和对象识别过程中，从深度网络的卷积层的高级表示创建特定类别的视觉显著性地图。通过使用AUC、SIM和KL等多个众所周知的度量来评估这些显著性图的预测能力，我们讨论了场景不一致性是否在这些显著性图中得到了体现。结果表明，场景中不一致的物体会导致显著性图的预测能力发生显著变化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

自引率

0.00%

发文量