Dinh Thang Hoang, Trung Kien Thai, Thanh Nguyen Chi, Long Quoc Trany
{"title":"实时暹罗视觉跟踪与轻量级变压器","authors":"Dinh Thang Hoang, Trung Kien Thai, Thanh Nguyen Chi, Long Quoc Trany","doi":"10.1109/NICS54270.2021.9701569","DOIUrl":null,"url":null,"abstract":"Trackers based on Siamese have demonstrated more remarkable performance in visual tracking. The majority of existing trackers typically compute target template and search image features independently, then utilize cross-correlation to predict the possibility of an object appearing at each spatial position in the search image for target localization. This paper proposes a Siamese network for feature enhancement and aggregation between the target template and the search image by utilizing a lightweight transformer with several linear self- and cross-attention layers. With anchor-free head prediction, the suggested framework is simple and effective. Extensive experiments on visual tracking benchmarks such as VOT2018, UAV123, and OTB100 demonstrates that our tracker achieves state-of-the-art performance and operates at a real-time frame rate of 39 fps.","PeriodicalId":296963,"journal":{"name":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Real-Time Siamese Visual Tracking with Lightweight Transformer\",\"authors\":\"Dinh Thang Hoang, Trung Kien Thai, Thanh Nguyen Chi, Long Quoc Trany\",\"doi\":\"10.1109/NICS54270.2021.9701569\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Trackers based on Siamese have demonstrated more remarkable performance in visual tracking. The majority of existing trackers typically compute target template and search image features independently, then utilize cross-correlation to predict the possibility of an object appearing at each spatial position in the search image for target localization. This paper proposes a Siamese network for feature enhancement and aggregation between the target template and the search image by utilizing a lightweight transformer with several linear self- and cross-attention layers. With anchor-free head prediction, the suggested framework is simple and effective. Extensive experiments on visual tracking benchmarks such as VOT2018, UAV123, and OTB100 demonstrates that our tracker achieves state-of-the-art performance and operates at a real-time frame rate of 39 fps.\",\"PeriodicalId\":296963,\"journal\":{\"name\":\"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NICS54270.2021.9701569\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 8th NAFOSTED Conference on Information and Computer Science (NICS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NICS54270.2021.9701569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Real-Time Siamese Visual Tracking with Lightweight Transformer
Trackers based on Siamese have demonstrated more remarkable performance in visual tracking. The majority of existing trackers typically compute target template and search image features independently, then utilize cross-correlation to predict the possibility of an object appearing at each spatial position in the search image for target localization. This paper proposes a Siamese network for feature enhancement and aggregation between the target template and the search image by utilizing a lightweight transformer with several linear self- and cross-attention layers. With anchor-free head prediction, the suggested framework is simple and effective. Extensive experiments on visual tracking benchmarks such as VOT2018, UAV123, and OTB100 demonstrates that our tracker achieves state-of-the-art performance and operates at a real-time frame rate of 39 fps.