潜在安全隐患的自动感知：用于特征对齐、图像分类和字幕的跨模态多任务框架

IF 9.9 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Advanced Engineering Informatics Pub Date : 2025-10-04 DOI:10.1016/j.aei.2025.103919

Yanjun Guo , Xinbo Ai , Mingxiu Guo , Shaoyang Cheng

{"title":"潜在安全隐患的自动感知：用于特征对齐、图像分类和字幕的跨模态多任务框架","authors":"Yanjun Guo , Xinbo Ai , Mingxiu Guo , Shaoyang Cheng","doi":"10.1016/j.aei.2025.103919","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7% <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"69 ","pages":"Article 103919"},"PeriodicalIF":9.9000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic perception of potential safety hazards: A cross-modal multi-task framework for feature alignment, image classification and captioning\",\"authors\":\"Yanjun Guo , Xinbo Ai , Mingxiu Guo , Shaoyang Cheng\",\"doi\":\"10.1016/j.aei.2025.103919\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7% <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.</div></div>\",\"PeriodicalId\":50941,\"journal\":{\"name\":\"Advanced Engineering Informatics\",\"volume\":\"69 \",\"pages\":\"Article 103919\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advanced Engineering Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1474034625008122\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1474034625008122","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

潜在安全隐患的自动感知（PSHs）对于确保工作场所安全和保护财产免受重大威胁至关重要。PSHs的感知包括确定是否存在危险，捕获现场图像，并完成检查报告，这对于减轻这些风险至关重要。尽管图像分类和图像字幕等计算机视觉技术为PSHs感知提供了有希望的替代方案。然而，综合危险感知不仅需要危险识别，还需要理解场景实体之间的语义关系，以形成描述性的安全报告。为了解决PSHs感知的多面性，本研究提出了一种名为Hazard-MTL的跨模态多任务学习（MTL）方法，该方法联合优化了三个协同任务：特征对齐（图像-文本）、二值图像分类和图像字幕。具体来说，我们的方法采用了一种场景图引导的思维链数据增强方法，该方法集成了知识提示和多任务上下文推理，以产生语义连贯和信息完整的风险描述。为了提高模型的鲁棒性，设计了双向对比损失来抑制无关的跨模态相似性。此外，引入了一种动态联合训练策略，将渐进式教师强迫与自适应损失加权相结合，以实现协调的多任务优化。我们的模型在PSHs分类方面的F1得分为72.7%(+10.7%)，在描述生成方面的CIDEr得分为1.575(+0.619)，优于单任务基线。Hazard-MTL通过集成MTL，推进整体场景理解，为企业和建筑安全管理提供更安全的自动化解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic perception of potential safety hazards: A cross-modal multi-task framework for feature alignment, image classification and captioning

Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7%

F_{1}

score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advanced Engineering Informatics 工程技术-工程：综合

CiteScore

12.40

自引率

18.20%

发文量

292

审稿时长

45 days

期刊介绍： Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.