Multi Spatial Relation Detection in Images

IF 1.6 4区心理学 Q3 PSYCHOLOGY, EXPERIMENTAL

Spatial Cognition and Computation Pub Date : 2021-08-04 DOI:10.1080/13875868.2021.1957897

Brandon Birmingham, A. Muscat

{"title":"Multi Spatial Relation Detection in Images","authors":"Brandon Birmingham, A. Muscat","doi":"10.1080/13875868.2021.1957897","DOIUrl":null,"url":null,"abstract":"ABSTRACT Detecting spatial relationships between objects depicted in an image is an important sub-task in vision and language understanding. Its practical use lies in visual discourse when referring to objects by their relationship in context of others and finds application in higher level tasks such as visual question answering and image description generation. Presumably, the selection of spatial prepositions grounded in an image is straightforward. However, in general, human beings either do not always agree or are not consistent when choosing spatial prepositions. This could be due to various reasons, such as near synonyms, overlapping terms and different frames of reference. For these reasons, the automatic detection of spatial relations is a non-trivial multi-label problem. This paper addresses the automatic multi-selection of prepositions. The study is based on the development of a number of machine learning models, namely Nearest Neighbor (NN), k-Means Clustering (kM-C), Agglomerative Hierarchical Clustering (A-HC) and Multi-label Neural Network (ML-NN). The model performances are compared quantitatively using multi-label metrics as well as human evaluations that are independent of the ground truth labels. Additionally, the classification results are used as a basis to carry out an error and qualitative analysis that sheds light on the relative merits of how each model deals with synonymous and overlapping relations, and groups common errors to inform future directions. Furthermore, to gain insight into the merits of multi-label models, a single-label Random Forest (RF) classifier is developed and its results are included in the analysis. Of all multi-label models, the ML-NN exhibits the best overall performance when evaluated on both the dataset ground truth and the independent human evaluations. It, however, suffers from under-generating prepositions, while the rest of the models often generate more prepositions at the expense of precision. The clustering-based methods are also not quite consistent, although they do better than the other models in less frequent spatial configurations that other models struggle with. The results from the single-label RF classifier highlight the usefulness of having a multi-label model. Finally, the error analysis indicates that the majority of errors is due to lack of features that give cues on object position and orientation (object pose), the fixed frame of reference, and the failure to resolve depth in perspective view.","PeriodicalId":46199,"journal":{"name":"Spatial Cognition and Computation","volume":"39 1","pages":"293 - 327"},"PeriodicalIF":1.6000,"publicationDate":"2021-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spatial Cognition and Computation","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1080/13875868.2021.1957897","RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

ABSTRACT Detecting spatial relationships between objects depicted in an image is an important sub-task in vision and language understanding. Its practical use lies in visual discourse when referring to objects by their relationship in context of others and finds application in higher level tasks such as visual question answering and image description generation. Presumably, the selection of spatial prepositions grounded in an image is straightforward. However, in general, human beings either do not always agree or are not consistent when choosing spatial prepositions. This could be due to various reasons, such as near synonyms, overlapping terms and different frames of reference. For these reasons, the automatic detection of spatial relations is a non-trivial multi-label problem. This paper addresses the automatic multi-selection of prepositions. The study is based on the development of a number of machine learning models, namely Nearest Neighbor (NN), k-Means Clustering (kM-C), Agglomerative Hierarchical Clustering (A-HC) and Multi-label Neural Network (ML-NN). The model performances are compared quantitatively using multi-label metrics as well as human evaluations that are independent of the ground truth labels. Additionally, the classification results are used as a basis to carry out an error and qualitative analysis that sheds light on the relative merits of how each model deals with synonymous and overlapping relations, and groups common errors to inform future directions. Furthermore, to gain insight into the merits of multi-label models, a single-label Random Forest (RF) classifier is developed and its results are included in the analysis. Of all multi-label models, the ML-NN exhibits the best overall performance when evaluated on both the dataset ground truth and the independent human evaluations. It, however, suffers from under-generating prepositions, while the rest of the models often generate more prepositions at the expense of precision. The clustering-based methods are also not quite consistent, although they do better than the other models in less frequent spatial configurations that other models struggle with. The results from the single-label RF classifier highlight the usefulness of having a multi-label model. Finally, the error analysis indicates that the majority of errors is due to lack of features that give cues on object position and orientation (object pose), the fixed frame of reference, and the failure to resolve depth in perspective view.

查看原文本刊更多论文

图像中的多空间关系检测

检测图像中所描绘物体之间的空间关系是视觉和语言理解中的一个重要子任务。它的实际应用是在视觉话语中，通过对象在其他上下文中的关系来引用对象，并在视觉问答和图像描述生成等更高层次的任务中得到应用。据推测，基于图像的空间介词的选择是直截了当的。然而，总的来说，人们在选择空间介词时要么不一致，要么不一致。这可能是由于各种原因造成的，比如近义词、重叠的术语和不同的参考框架。因此，空间关系的自动检测是一个重要的多标签问题。本文研究了介词的自动多选问题。该研究基于许多机器学习模型的发展，即最近邻(NN)， k-均值聚类(kM-C)，凝聚分层聚类(a - hc)和多标签神经网络(ML-NN)。模型性能使用多标签度量以及独立于地面真值标签的人类评估进行定量比较。此外，分类结果被用作进行错误和定性分析的基础，揭示每个模型如何处理同义和重叠关系的相对优点，并对常见错误进行分组，以告知未来的方向。此外，为了深入了解多标签模型的优点，开发了单标签随机森林(RF)分类器，并将其结果包含在分析中。在所有多标签模型中，ML-NN在数据集真实值和独立的人类评估上都表现出最佳的整体性能。然而，它存在介词生成不足的问题，而其他模型往往以牺牲精度为代价生成更多的介词。基于聚类的方法也不太一致，尽管它们在其他模型难以处理的不太频繁的空间配置中比其他模型做得更好。单标签射频分类器的结果突出了拥有多标签模型的有用性。最后，误差分析表明，大多数误差是由于缺乏提供物体位置和方向(物体姿态)线索的特征，固定的参考框架以及透视视图中无法解决深度问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Spatial Cognition and Computation PSYCHOLOGY, EXPERIMENTAL-

CiteScore

4.40

自引率

5.30%

发文量