{"title":"潜在安全隐患的自动感知:用于特征对齐、图像分类和字幕的跨模态多任务框架","authors":"Yanjun Guo , Xinbo Ai , Mingxiu Guo , Shaoyang Cheng","doi":"10.1016/j.aei.2025.103919","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7% <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"69 ","pages":"Article 103919"},"PeriodicalIF":9.9000,"publicationDate":"2025-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic perception of potential safety hazards: A cross-modal multi-task framework for feature alignment, image classification and captioning\",\"authors\":\"Yanjun Guo , Xinbo Ai , Mingxiu Guo , Shaoyang Cheng\",\"doi\":\"10.1016/j.aei.2025.103919\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7% <span><math><msub><mrow><mi>F</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span> score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.</div></div>\",\"PeriodicalId\":50941,\"journal\":{\"name\":\"Advanced Engineering Informatics\",\"volume\":\"69 \",\"pages\":\"Article 103919\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2025-10-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advanced Engineering Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1474034625008122\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1474034625008122","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Automatic perception of potential safety hazards: A cross-modal multi-task framework for feature alignment, image classification and captioning
Automatic perception of potential safety hazards (PSHs) is critical for ensuring workplace safety and protecting property against significant threats. PSHs perception involves determining whether hazards exist, capturing on-site images, and completing inspection reports, which are critical for mitigating these risks. Though computer vision techniques like image classification and image captioning offer promising alternatives for PSHs perception. However, comprehensive hazard perception requires not only hazard identification but also semantic relationship comprehension among scene entities to formulate descriptive safety reports. To address the multifaceted nature of PSHs perception, this study proposes a cross-modal multi-task learning (MTL) method named Hazard-MTL, which jointly optimizes three synergistic tasks: feature alignment (image–text), binary image classification, and image captioning. Specifically, our approach employs a scene graph-guided chain-of-thought data augmentation method that integrates knowledge prompts and multi-task contextual reasoning to produce semantically coherent and informationally complete risk descriptions. To improve the model robustness, a bidirectional contrastive loss was designed to suppress irrelevant cross-modal similarities. Additionally, a dynamic joint training strategy is introduced that combines progressive teacher forcing with adaptive loss weighting to achieve harmonized multi-task optimization. Our model outperforms single-task baselines with 72.7% score (+10.7%) for PSHs classification and 1.575 CIDEr (+0.619) for description generation. Hazard-MTL advances holistic scene understanding by integrating MTL, offering a safer automated solution for enterprise and construction safety management.
期刊介绍:
Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.