{"title":"弱监督对象定位的增强和全局语义激活","authors":"Yin Liu, Lingyun Wang, Xin Xu, Xiaopeng Luo","doi":"10.1109/ICNSC55942.2022.10004147","DOIUrl":null,"url":null,"abstract":"Weakly supervised object localization(WSOL) is a task that only uses image-level supervision information to locate objects. Traditional CNN-based methods always locate the most discriminative regions of objects and cannot well balance the accuracy of classification and localization. To solve this problem, we propose an enhanced and global semantic activation(EGSA) method based on the vision transformer model. We first use an attention reassign module to get a comprehensive attention map that contains the correlation between each image patch and the global dependency of the class token. Then a mask selection module that generates a mask map by comparing with mask threshold is proposed to obtain the token feature map of the non-discriminative object region. By coupling the above two maps and combining it with a semantic aware map contains the information of class token, the final localization map with enhanced and global semantic activation can be built. And experiments on two common benchmark datasets CUB-200-2011 and ILSVRC demonstrate the efficiency of our method.","PeriodicalId":230499,"journal":{"name":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"EGSA: Enhanced and Global Semantic Activation for Weakly Supervised Object Localization\",\"authors\":\"Yin Liu, Lingyun Wang, Xin Xu, Xiaopeng Luo\",\"doi\":\"10.1109/ICNSC55942.2022.10004147\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Weakly supervised object localization(WSOL) is a task that only uses image-level supervision information to locate objects. Traditional CNN-based methods always locate the most discriminative regions of objects and cannot well balance the accuracy of classification and localization. To solve this problem, we propose an enhanced and global semantic activation(EGSA) method based on the vision transformer model. We first use an attention reassign module to get a comprehensive attention map that contains the correlation between each image patch and the global dependency of the class token. Then a mask selection module that generates a mask map by comparing with mask threshold is proposed to obtain the token feature map of the non-discriminative object region. By coupling the above two maps and combining it with a semantic aware map contains the information of class token, the final localization map with enhanced and global semantic activation can be built. And experiments on two common benchmark datasets CUB-200-2011 and ILSVRC demonstrate the efficiency of our method.\",\"PeriodicalId\":230499,\"journal\":{\"name\":\"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICNSC55942.2022.10004147\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Networking, Sensing and Control (ICNSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICNSC55942.2022.10004147","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
EGSA: Enhanced and Global Semantic Activation for Weakly Supervised Object Localization
Weakly supervised object localization(WSOL) is a task that only uses image-level supervision information to locate objects. Traditional CNN-based methods always locate the most discriminative regions of objects and cannot well balance the accuracy of classification and localization. To solve this problem, we propose an enhanced and global semantic activation(EGSA) method based on the vision transformer model. We first use an attention reassign module to get a comprehensive attention map that contains the correlation between each image patch and the global dependency of the class token. Then a mask selection module that generates a mask map by comparing with mask threshold is proposed to obtain the token feature map of the non-discriminative object region. By coupling the above two maps and combining it with a semantic aware map contains the information of class token, the final localization map with enhanced and global semantic activation can be built. And experiments on two common benchmark datasets CUB-200-2011 and ILSVRC demonstrate the efficiency of our method.