{"title":"ROMOT:参照-表达-理解开放集多目标跟踪","authors":"Wei Li, Bowen Li, Jingqi Wang, Weiliang Meng, Jiguang Zhang, Xiaopeng Zhang","doi":"10.1007/s00371-024-03544-7","DOIUrl":null,"url":null,"abstract":"<p>Traditional multi-object tracking is limited to tracking a predefined set of categories, whereas open-vocabulary tracking expands its capabilities to track novel categories. In this paper, we propose ROMOT (referring-expression-comprehension open-set multi-object tracking), which not only tracks objects from novel categories not included in the training data, but also enables tracking based on referring expression comprehension (REC). REC describes targets solely by their attributes, such as “the person running at the front” or “the bird flying in the air rather than on the ground,” making it particularly relevant for real-world multi-object tracking scenarios. Our ROMOT achieves this by harnessing the exceptional capabilities of a visual language model and employing multi-stage cross-modal attention to handle tracking for novel categories and REC tasks. Integrating RSM (reconstruction similarity metric) and OCM (observation-centric momentum) in our ROMOT eliminates the need for task-specific training, addressing the challenge of insufficient datasets. Our ROMOT enhances efficiency and adaptability in handling tracking requirements without relying on extensive tracking training data.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ROMOT: Referring-expression-comprehension open-set multi-object tracking\",\"authors\":\"Wei Li, Bowen Li, Jingqi Wang, Weiliang Meng, Jiguang Zhang, Xiaopeng Zhang\",\"doi\":\"10.1007/s00371-024-03544-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Traditional multi-object tracking is limited to tracking a predefined set of categories, whereas open-vocabulary tracking expands its capabilities to track novel categories. In this paper, we propose ROMOT (referring-expression-comprehension open-set multi-object tracking), which not only tracks objects from novel categories not included in the training data, but also enables tracking based on referring expression comprehension (REC). REC describes targets solely by their attributes, such as “the person running at the front” or “the bird flying in the air rather than on the ground,” making it particularly relevant for real-world multi-object tracking scenarios. Our ROMOT achieves this by harnessing the exceptional capabilities of a visual language model and employing multi-stage cross-modal attention to handle tracking for novel categories and REC tasks. Integrating RSM (reconstruction similarity metric) and OCM (observation-centric momentum) in our ROMOT eliminates the need for task-specific training, addressing the challenge of insufficient datasets. Our ROMOT enhances efficiency and adaptability in handling tracking requirements without relying on extensive tracking training data.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"40 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03544-7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03544-7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Traditional multi-object tracking is limited to tracking a predefined set of categories, whereas open-vocabulary tracking expands its capabilities to track novel categories. In this paper, we propose ROMOT (referring-expression-comprehension open-set multi-object tracking), which not only tracks objects from novel categories not included in the training data, but also enables tracking based on referring expression comprehension (REC). REC describes targets solely by their attributes, such as “the person running at the front” or “the bird flying in the air rather than on the ground,” making it particularly relevant for real-world multi-object tracking scenarios. Our ROMOT achieves this by harnessing the exceptional capabilities of a visual language model and employing multi-stage cross-modal attention to handle tracking for novel categories and REC tasks. Integrating RSM (reconstruction similarity metric) and OCM (observation-centric momentum) in our ROMOT eliminates the need for task-specific training, addressing the challenge of insufficient datasets. Our ROMOT enhances efficiency and adaptability in handling tracking requirements without relying on extensive tracking training data.