Chaojun Han, Fumin Shen, Li Liu, Yang Yang, Heng Tao Shen
{"title":"Visual Spatial Attention Network for Relationship Detection","authors":"Chaojun Han, Fumin Shen, Li Liu, Yang Yang, Heng Tao Shen","doi":"10.1145/3240508.3240611","DOIUrl":null,"url":null,"abstract":"Visual relationship detection, which aims to predict a triplet with the detected objects, has attracted increasing attention in the scene understanding study. During tackling this problem, dealing with varying scales of the subjects and objects is of great importance, which has been less studied. To overcome this challenge, we propose a novel Vision Spatial Attention Network (VSA-Net), which employs a two-dimensional normal distribution attention scheme to effectively model small objects. In addition, we design a Subject-Object-layer (SO-layer) to distinguish between the subject and object to attain more precise results. To the best of our knowledge, VSA-Net is the first end-to-end attention mechanism based visual relationship detection model. Extensive experiments on the benchmark datasets (VRD and VG) show that, by using pure vision information, our VSA-Net achieves state-of-the-art performance for predicate detection, phrase detection, and relationship detection.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240611","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29
Abstract
Visual relationship detection, which aims to predict a triplet with the detected objects, has attracted increasing attention in the scene understanding study. During tackling this problem, dealing with varying scales of the subjects and objects is of great importance, which has been less studied. To overcome this challenge, we propose a novel Vision Spatial Attention Network (VSA-Net), which employs a two-dimensional normal distribution attention scheme to effectively model small objects. In addition, we design a Subject-Object-layer (SO-layer) to distinguish between the subject and object to attain more precise results. To the best of our knowledge, VSA-Net is the first end-to-end attention mechanism based visual relationship detection model. Extensive experiments on the benchmark datasets (VRD and VG) show that, by using pure vision information, our VSA-Net achieves state-of-the-art performance for predicate detection, phrase detection, and relationship detection.