{"title":"RETR:端到端引用表达式理解与变形","authors":"Yang Rui","doi":"10.1109/ICCWAMTIP56608.2022.10016599","DOIUrl":null,"url":null,"abstract":"Referring Expression Comprehension (REC) is a basic and challenging task to identify the referred region given a language expression. However, existing two-stage or one-stage methods suffer from the region proposals, the limited range of visual context and the incomplete cross-modal alignment. To address these problems, we propose a simple yet effective one-stage model, termed REC TRansformer (RETR), which is trained end-to-end. Different from the manually designed multi-modal fusion, RETR adopts a transformer decoder with alternately stacked self-attention and cross-attention layers to capture the global visual context and establish the detailed visual-linguistic correspondence. Moreover, we utilize multiple learnable tokens to obtain diverse yet complementary region representations to give the accurate prediction. Extensive experiments are conducted on four datasets and RETR achieves the state-of-the-art performance.","PeriodicalId":159508,"journal":{"name":"2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RETR: End-To-End Referring Expression Comprehension with Transformers\",\"authors\":\"Yang Rui\",\"doi\":\"10.1109/ICCWAMTIP56608.2022.10016599\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring Expression Comprehension (REC) is a basic and challenging task to identify the referred region given a language expression. However, existing two-stage or one-stage methods suffer from the region proposals, the limited range of visual context and the incomplete cross-modal alignment. To address these problems, we propose a simple yet effective one-stage model, termed REC TRansformer (RETR), which is trained end-to-end. Different from the manually designed multi-modal fusion, RETR adopts a transformer decoder with alternately stacked self-attention and cross-attention layers to capture the global visual context and establish the detailed visual-linguistic correspondence. Moreover, we utilize multiple learnable tokens to obtain diverse yet complementary region representations to give the accurate prediction. Extensive experiments are conducted on four datasets and RETR achieves the state-of-the-art performance.\",\"PeriodicalId\":159508,\"journal\":{\"name\":\"2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCWAMTIP56608.2022.10016599\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 19th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCWAMTIP56608.2022.10016599","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
RETR: End-To-End Referring Expression Comprehension with Transformers
Referring Expression Comprehension (REC) is a basic and challenging task to identify the referred region given a language expression. However, existing two-stage or one-stage methods suffer from the region proposals, the limited range of visual context and the incomplete cross-modal alignment. To address these problems, we propose a simple yet effective one-stage model, termed REC TRansformer (RETR), which is trained end-to-end. Different from the manually designed multi-modal fusion, RETR adopts a transformer decoder with alternately stacked self-attention and cross-attention layers to capture the global visual context and establish the detailed visual-linguistic correspondence. Moreover, we utilize multiple learnable tokens to obtain diverse yet complementary region representations to give the accurate prediction. Extensive experiments are conducted on four datasets and RETR achieves the state-of-the-art performance.