Controllable diffusion models for hazardous construction site scene generation

IF 6.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-06-24 DOI:10.1016/j.asoc.2025.113446

XunLong Wang, JiangTao Ren

{"title":"Controllable diffusion models for hazardous construction site scene generation","authors":"XunLong Wang, JiangTao Ren","doi":"10.1016/j.asoc.2025.113446","DOIUrl":null,"url":null,"abstract":"<div><div>Hazardous scene recognition is critical for construction site safety, but the low occurrence of such scenes in real environments leads to insufficient training data, limiting model development. Hazardous scene generation helps address data scarcity but involves complex background-object relationships and significant size differences, making precise layout control difficult. These characteristics make achieving precise layout control challenging for general-purpose hazardous scene generation models. To address these challenges, we propose a novel construction site hazardous scene generation framework based on large language and diffusion models, consisting of a two-stage generation process. In the first stage, we fine-tune the large language model (LLM) through context learning to serve as a text-based layout generator. In the second stage, we introduce a novel text-to-image diffusion model to guide the image generation process, ensuring that the generated image adheres to the scene layout produced in the first stage. Additionally, We propose two key modules, the Layout Enhancement Module and the Scale Fusion Module, to improve image quality and layout adherence. Comparative experiments show that our method generates superior scenes with stronger controllability and higher quality. Testing on a real-world dataset achieved a mAP score of 38.1%, improving model accuracy by 20.4% AP, 17.5% <span><math><mrow><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mn>50</mn></mrow></msub></mrow></math></span>, and 17.0% AR compared to models trained on real data, demonstrating our method’s effectiveness in hazardous construction site scene generation.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"181 ","pages":"Article 113446"},"PeriodicalIF":6.6000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625007574","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Hazardous scene recognition is critical for construction site safety, but the low occurrence of such scenes in real environments leads to insufficient training data, limiting model development. Hazardous scene generation helps address data scarcity but involves complex background-object relationships and significant size differences, making precise layout control difficult. These characteristics make achieving precise layout control challenging for general-purpose hazardous scene generation models. To address these challenges, we propose a novel construction site hazardous scene generation framework based on large language and diffusion models, consisting of a two-stage generation process. In the first stage, we fine-tune the large language model (LLM) through context learning to serve as a text-based layout generator. In the second stage, we introduce a novel text-to-image diffusion model to guide the image generation process, ensuring that the generated image adheres to the scene layout produced in the first stage. Additionally, We propose two key modules, the Layout Enhancement Module and the Scale Fusion Module, to improve image quality and layout adherence. Comparative experiments show that our method generates superior scenes with stronger controllability and higher quality. Testing on a real-world dataset achieved a mAP score of 38.1%, improving model accuracy by 20.4% AP, 17.5%

A P_{50}

, and 17.0% AR compared to models trained on real data, demonstrating our method’s effectiveness in hazardous construction site scene generation.

Abstract Image

查看原文本刊更多论文

危险建筑现场场景生成的可控扩散模型

危险场景识别对于建筑工地的安全至关重要，但由于真实环境中危险场景的发生率较低，导致训练数据不足，限制了模型的发展。危险场景生成有助于解决数据稀缺问题，但涉及复杂的背景对象关系和显著的尺寸差异，使得精确的布局控制变得困难。这些特点使得实现精确的布局控制具有挑战性的通用危险场景生成模型。为了应对这些挑战，我们提出了一种基于大型语言和扩散模型的新型建筑工地危险场景生成框架，该框架由两个阶段的生成过程组成。在第一阶段，我们通过上下文学习对大型语言模型（LLM）进行微调，使其作为基于文本的布局生成器。在第二阶段，我们引入了一种新的文本到图像扩散模型来指导图像生成过程，确保生成的图像符合第一阶段生成的场景布局。此外，我们提出了两个关键模块，布局增强模块和比例融合模块，以提高图像质量和布局依从性。对比实验表明，该方法生成的场景具有较强的可控性和较高的质量。在真实数据集上的测试获得了38.1%的mAP分数，与在真实数据上训练的模型相比，模型的AP精度提高了20.4%，AP50提高了17.5%，AR提高了17.0%，证明了我们的方法在危险建筑工地场景生成方面的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.