Object Detection Data Synthesis via Box-to-Image Generation based on Diffusion Models.

IF 18.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2025-09-15 DOI:10.1109/tpami.2025.3609962

Jingyuan Zhu,Huimin Ma,Jiansheng Chen,Jian Yuan

{"title":"Object Detection Data Synthesis via Box-to-Image Generation based on Diffusion Models.","authors":"Jingyuan Zhu,Huimin Ma,Jiansheng Chen,Jian Yuan","doi":"10.1109/tpami.2025.3609962","DOIUrl":null,"url":null,"abstract":"Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. We also design an evaluation protocol based on COCO-2014 to validate the synthetic data of ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods. In addition, we employ a series of large-scale object detection datasets to train a general model named Stable Box Diffusion, which covers thousands of object categories in most common scenes.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":"24 1","pages":""},"PeriodicalIF":18.6000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/tpami.2025.3609962","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. We also design an evaluation protocol based on COCO-2014 to validate the synthetic data of ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods. In addition, we employ a series of large-scale object detection datasets to train a general model named Stable Box Diffusion, which covers thousands of object categories in most common scenes.

查看原文本刊更多论文

基于扩散模型的盒像生成目标检测数据综合。

现代基于扩散的图像生成模型已经取得了重大进展，有望丰富目标检测任务的训练数据。然而，对于包含多类物体和密集物体遮挡的复杂场景，生成质量和可控性仍然有限。本文提出了一种以边界框为条件生成高质量图像的新方法ODGEN，从而促进了目标检测的数据合成。给定特定领域的目标检测数据集，我们首先在裁剪的前景对象和整个图像上微调预训练的扩散模型以拟合目标分布。然后，我们提出使用具有空间约束和对象智能文本描述的综合视觉提示来控制扩散模型。ODGEN在处理复杂场景和特定领域方面表现出鲁棒性。此外，我们设计了一个数据集合成管道，在7个特定领域的基准上评估ODGEN，以证明其有效性。添加由ODGEN生成的训练数据提高了25.3% mAP@.50：。使用YOLOv5和YOLOv7等目标检测器，优于先前的可控生成方法。我们还设计了基于COCO-2014的评估方案，验证了ODGEN在一般领域的合成数据，并在mAP@.50：中观察到高达5.6%的优势。反对现有的方法。此外，我们使用一系列大规模的目标检测数据集来训练一个名为“稳定盒扩散”的通用模型，该模型涵盖了大多数常见场景中的数千个目标类别。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.