Jiaxuan Wang, Chaoyi Wang, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Haibing Guan
{"title":"Positional Mask Attention for Video Sequence Modeling","authors":"Jiaxuan Wang, Chaoyi Wang, Yang Hua, Tao Song, Zhengui Xue, Ruhui Ma, Haibing Guan","doi":"10.1109/CISP-BMEI53629.2021.9624361","DOIUrl":null,"url":null,"abstract":"The attention mechanism has been widely developed in different domains. Some recent studies apply position embedding to encode relative positions in the attention mechanism for learning better representations in both natural language processing and computer vision tasks. However, this position embedding method is limited to the “fixed input size” problem and requires large additional memory to store the position embedding parameters. In this paper, we present the positional mask attention, which is a new approach to incorporate position information into the attention mechanism. Specifically, a positional distance mask is proposed to encode the relative positions as a type of prior knowledge, which is different from the existing position embedding methods. To verify the generality and effectiveness of the proposed method, we evaluate our positional mask attention on two general video understanding tasks, i.e., video object detection and video instance segmentation. Experimental results demonstrate that our method can achieve significant improvement by aggregating the position information.","PeriodicalId":131256,"journal":{"name":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISP-BMEI53629.2021.9624361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The attention mechanism has been widely developed in different domains. Some recent studies apply position embedding to encode relative positions in the attention mechanism for learning better representations in both natural language processing and computer vision tasks. However, this position embedding method is limited to the “fixed input size” problem and requires large additional memory to store the position embedding parameters. In this paper, we present the positional mask attention, which is a new approach to incorporate position information into the attention mechanism. Specifically, a positional distance mask is proposed to encode the relative positions as a type of prior knowledge, which is different from the existing position embedding methods. To verify the generality and effectiveness of the proposed method, we evaluate our positional mask attention on two general video understanding tasks, i.e., video object detection and video instance segmentation. Experimental results demonstrate that our method can achieve significant improvement by aggregating the position information.