Dai Quoc Tran , Armstrong Aboah , Yuntae Jeon , Minh-Truyen Do , Mohamed Abdel-Aty , Minsoo Park , Seunghee Park
{"title":"Visual Question Answering-based Referring Expression Segmentation for construction safety analysis","authors":"Dai Quoc Tran , Armstrong Aboah , Yuntae Jeon , Minh-Truyen Do , Mohamed Abdel-Aty , Minsoo Park , Seunghee Park","doi":"10.1016/j.autcon.2025.106127","DOIUrl":null,"url":null,"abstract":"<div><div>Despite advancements in computer vision techniques like object detection and segmentation, a significant gap remains in leveraging these technologies for hazard recognition through natural language processing. To address this gap, this paper proposes VQA-RESCon, an approach that combines Visual Question Answering (VQA) and Referring Expression Segmentation (RES) to enhance construction safety analysis. By leveraging the visual grounding capabilities of RES, our method not only identifies potential hazards through VQA but also precisely localizes and highlights these hazards within the image. The method utilizes a large “scenario-questions” dataset comprising 200,000 images and 16 targeted questions to train a vision-and-language transformer model. In addition, post-processing techniques were employed using the ClipSeg and Segment Anything Model. The validation results indicate that both the VQA and RES models demonstrate notable reliability and precision. The VQA model achieves an F1 score surpassing 90%, while the segmentation models achieve a Mean Intersection over Union of 57%.</div></div>","PeriodicalId":8660,"journal":{"name":"Automation in Construction","volume":"174 ","pages":"Article 106127"},"PeriodicalIF":9.6000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Automation in Construction","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0926580525001670","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CONSTRUCTION & BUILDING TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Despite advancements in computer vision techniques like object detection and segmentation, a significant gap remains in leveraging these technologies for hazard recognition through natural language processing. To address this gap, this paper proposes VQA-RESCon, an approach that combines Visual Question Answering (VQA) and Referring Expression Segmentation (RES) to enhance construction safety analysis. By leveraging the visual grounding capabilities of RES, our method not only identifies potential hazards through VQA but also precisely localizes and highlights these hazards within the image. The method utilizes a large “scenario-questions” dataset comprising 200,000 images and 16 targeted questions to train a vision-and-language transformer model. In addition, post-processing techniques were employed using the ClipSeg and Segment Anything Model. The validation results indicate that both the VQA and RES models demonstrate notable reliability and precision. The VQA model achieves an F1 score surpassing 90%, while the segmentation models achieve a Mean Intersection over Union of 57%.
期刊介绍:
Automation in Construction is an international journal that focuses on publishing original research papers related to the use of Information Technologies in various aspects of the construction industry. The journal covers topics such as design, engineering, construction technologies, and the maintenance and management of constructed facilities.
The scope of Automation in Construction is extensive and covers all stages of the construction life cycle. This includes initial planning and design, construction of the facility, operation and maintenance, as well as the eventual dismantling and recycling of buildings and engineering structures.