{"title":"基于ConvNeXt和Transformer的集成长短期跟踪","authors":"Yuhua Xiao, Yifeng Zhang, Pengyu Ni","doi":"10.1109/ICIVC55077.2022.9887117","DOIUrl":null,"url":null,"abstract":"Visual object tracking is an important research topic in Computer Vision. The widely used Siamese network architecture learns a similarity metric between target objects and search regions, and locates the targets in video sequences. In this paper, we present an ensemble long short-term tracking algorithm based on ConvNeXt and Transformer. Firstly, a Siamese network with the ConvNeXt backbone is applied to extract features for both target and search regions. Secondly, an encoder-decoder transformer is introduced to capture global feature dependencies. In addition, an IoU-confidence-based tracking ensemble algorithm is designed to capture both long-term stable appearances and short-term variable appearances of the target. The proposed tracker, called STARK-NeXt, achieves a success rate of 68.9% on LaSOT, outperforming STARK by 1.8%.","PeriodicalId":227073,"journal":{"name":"2022 7th International Conference on Image, Vision and Computing (ICIVC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Ensemble Long Short-Term Tracking with ConvNeXt and Transformer\",\"authors\":\"Yuhua Xiao, Yifeng Zhang, Pengyu Ni\",\"doi\":\"10.1109/ICIVC55077.2022.9887117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Visual object tracking is an important research topic in Computer Vision. The widely used Siamese network architecture learns a similarity metric between target objects and search regions, and locates the targets in video sequences. In this paper, we present an ensemble long short-term tracking algorithm based on ConvNeXt and Transformer. Firstly, a Siamese network with the ConvNeXt backbone is applied to extract features for both target and search regions. Secondly, an encoder-decoder transformer is introduced to capture global feature dependencies. In addition, an IoU-confidence-based tracking ensemble algorithm is designed to capture both long-term stable appearances and short-term variable appearances of the target. The proposed tracker, called STARK-NeXt, achieves a success rate of 68.9% on LaSOT, outperforming STARK by 1.8%.\",\"PeriodicalId\":227073,\"journal\":{\"name\":\"2022 7th International Conference on Image, Vision and Computing (ICIVC)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Image, Vision and Computing (ICIVC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIVC55077.2022.9887117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Image, Vision and Computing (ICIVC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIVC55077.2022.9887117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Ensemble Long Short-Term Tracking with ConvNeXt and Transformer
Visual object tracking is an important research topic in Computer Vision. The widely used Siamese network architecture learns a similarity metric between target objects and search regions, and locates the targets in video sequences. In this paper, we present an ensemble long short-term tracking algorithm based on ConvNeXt and Transformer. Firstly, a Siamese network with the ConvNeXt backbone is applied to extract features for both target and search regions. Secondly, an encoder-decoder transformer is introduced to capture global feature dependencies. In addition, an IoU-confidence-based tracking ensemble algorithm is designed to capture both long-term stable appearances and short-term variable appearances of the target. The proposed tracker, called STARK-NeXt, achieves a success rate of 68.9% on LaSOT, outperforming STARK by 1.8%.