Pro2Diff: Proposal Propagation for Multi-Object Tracking via the Diffusion Model

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-11-14 DOI:10.1109/TIP.2024.3494600

Hongmin Liu;Canbin Zhang;Bin Fan;Jinglin Xu

{"title":"Pro2Diff: Proposal Propagation for Multi-Object Tracking via the Diffusion Model","authors":"Hongmin Liu;Canbin Zhang;Bin Fan;Jinglin Xu","doi":"10.1109/TIP.2024.3494600","DOIUrl":null,"url":null,"abstract":"Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6508-6520"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10753449/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Multi-object tracking (MOT) aims to estimate the bounding boxes and ID labels of objects in videos. The challenging issue in this task is to alleviate competitive learning between the detection and tracking subtasks, for which, two-stage Tracking-By-Detection (TBD) optimizes the two subtasks individually, and the single-stage Joint Detection and Tracking (JDT) adjusts the complex network architectures finely in an end-to-end pipeline. In this paper, we propose a new MOT method, i.e., Proposal Propagation via Diffusion Models, called Pro2Diff, which integrates a diffusion model into the proposal propagation in multi-object tracking, focusing on the model training process rather than complex network design. Specifically, using a generative approach, Pro2Diff generates a considerable number of noisy proposals for the tracking image sequence in the forward process, and subsequently, Pro2Diff learns the discrepancies between these noisy proposals and the actual bounding boxes of the tracked objects, gradually optimizing these noisy proposals to obtain the initial sequence of real tracked objects. By introducing the denoising diffusion process into multi-object tracking, we have made three further important findings: 1) Generative methods can effectively handle multi-object tracking tasks; 2) Without the need to modify the model structure, we propose self-conditional proposal propagation to enhance model performance effectively during inference; 3) By adjusting the numbers of proposals and iterations appropriately for different tracking sequences, the optimal performance of the model can be achieved. Extensive experimental results on MOT17 and DanceTrack datasets demonstrate that Pro2Diff outperforms current end-to-end multi-object tracking methods. We achieve 61.9 HOTA on DanceTrack and 57.6 HOTA on MOT17, reaching the competitive result of the JDT approach.

查看原文本刊更多论文

Pro2Diff：通过扩散模型进行多目标跟踪的提案传播

多目标跟踪（MOT）旨在估计视频中物体的边界框和 ID 标签。在这项任务中，具有挑战性的问题是如何缓解检测和跟踪子任务之间的竞争性学习，为此，两阶段跟踪检测（Tracking-By-Detection，TBD）分别对这两个子任务进行优化，而单阶段联合检测和跟踪（Joint Detection and Tracking，JDT）则在端到端流水线中对复杂的网络架构进行精细调整。在本文中，我们提出了一种新的 MOT 方法，即通过扩散模型进行提议传播（Proposal Propagation via Diffusion Models），称为 Pro2Diff，它将扩散模型集成到多目标跟踪的提议传播中，重点关注模型训练过程而非复杂的网络设计。具体来说，Pro2Diff 采用生成式方法，在前向过程中为跟踪图像序列生成相当数量的噪声提议，随后，Pro2Diff 学习这些噪声提议与实际跟踪对象边界框之间的差异，逐步优化这些噪声提议，从而获得真实跟踪对象的初始序列。通过在多目标跟踪中引入去噪扩散过程，我们又有了三个重要发现：1）生成式方法可以有效地处理多目标跟踪任务；2）无需修改模型结构，我们提出了自条件提案传播法，可以在推理过程中有效地提高模型性能；3）通过针对不同的跟踪序列适当调整提案数和迭代数，可以实现模型的最佳性能。在 MOT17 和 DanceTrack 数据集上的大量实验结果表明，Pro2Diff 优于目前的端到端多目标跟踪方法。我们在 DanceTrack 上获得了 61.9 HOTA，在 MOT17 上获得了 57.6 HOTA，达到了 JDT 方法的竞争结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量