基于机器学习的大规模数据集施工任务工期预测模型

IF 9.9 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Advanced Engineering Informatics Pub Date : 2025-09-10 DOI:10.1016/j.aei.2025.103820

Yaping Liu, Huan Luo

{"title":"基于机器学习的大规模数据集施工任务工期预测模型","authors":"Yaping Liu, Huan Luo","doi":"10.1016/j.aei.2025.103820","DOIUrl":null,"url":null,"abstract":"<div><div>Infrastructure project delays increasingly cause substantial economic losses and operation risks, yet conventional methods for construction schedule risk analysis remain reliant on subjective empirical judgments. While machine learning (ML) methods have mitigated some limitations, prevailing approaches focus on macro-level predictions using small-sample datasets, largely neglecting textual data in construction tasks. To address these issues, this study proposes KLWLS-SVMR, a novel ML model that integrates textual and numerical features to predict construction task durations. The proposed model quantifies unstructured task descriptions through topic modeling, constructs optimal feature sets via a PCA-random forest hybrid mechanism, and integrates k-means clustering with locally weighted support vector regression to enhance prediction accuracy and computational efficiency for large-scale datasets. The superior performance of proposed method is demonstrated by comparison with 11 popularly used ML methods based on a dataset covering 140,378 real-world tasks. Compared to the best-performing benchmarks, Extremely Randomized Trees (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>968</mn></mrow></math></span>) and Adaptive Boosting (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>960</mn></mrow></math></span>), the proposed model achieves a higher <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> value of 0.973 while reducing computational time by 25.0% and 79.7%, respectively. Compared to the original feature set, the KLWLS-SVMR trained with optimal feature sets formulated by the proposed hybrid mechanism shows an 11.3% increase in <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>, 53.5% and 51.8% decrease in RMSE and MAE, respectively, while significantly improving computational efficiency by 40.0%. Rigorous hypothesis testing confirms that all ML models trained with the optimal feature sets exhibit statistical significance (<span><math><mrow><mi>p</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>006</mn><mo>≪</mo><mn>0</mn><mo>.</mo><mn>05</mn></mrow></math></span>) for prediction performance improvement. This work advances ML applications in construction engineering by providing a practical technical pathway for optimizing task-level resource scheduling and risk management.</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"69 ","pages":"Article 103820"},"PeriodicalIF":9.9000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An efficient machine learning-based model for duration prediction of construction tasks with large-scale datasets\",\"authors\":\"Yaping Liu, Huan Luo\",\"doi\":\"10.1016/j.aei.2025.103820\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Infrastructure project delays increasingly cause substantial economic losses and operation risks, yet conventional methods for construction schedule risk analysis remain reliant on subjective empirical judgments. While machine learning (ML) methods have mitigated some limitations, prevailing approaches focus on macro-level predictions using small-sample datasets, largely neglecting textual data in construction tasks. To address these issues, this study proposes KLWLS-SVMR, a novel ML model that integrates textual and numerical features to predict construction task durations. The proposed model quantifies unstructured task descriptions through topic modeling, constructs optimal feature sets via a PCA-random forest hybrid mechanism, and integrates k-means clustering with locally weighted support vector regression to enhance prediction accuracy and computational efficiency for large-scale datasets. The superior performance of proposed method is demonstrated by comparison with 11 popularly used ML methods based on a dataset covering 140,378 real-world tasks. Compared to the best-performing benchmarks, Extremely Randomized Trees (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>968</mn></mrow></math></span>) and Adaptive Boosting (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>960</mn></mrow></math></span>), the proposed model achieves a higher <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> value of 0.973 while reducing computational time by 25.0% and 79.7%, respectively. Compared to the original feature set, the KLWLS-SVMR trained with optimal feature sets formulated by the proposed hybrid mechanism shows an 11.3% increase in <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>, 53.5% and 51.8% decrease in RMSE and MAE, respectively, while significantly improving computational efficiency by 40.0%. Rigorous hypothesis testing confirms that all ML models trained with the optimal feature sets exhibit statistical significance (<span><math><mrow><mi>p</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>006</mn><mo>≪</mo><mn>0</mn><mo>.</mo><mn>05</mn></mrow></math></span>) for prediction performance improvement. This work advances ML applications in construction engineering by providing a practical technical pathway for optimizing task-level resource scheduling and risk management.</div></div>\",\"PeriodicalId\":50941,\"journal\":{\"name\":\"Advanced Engineering Informatics\",\"volume\":\"69 \",\"pages\":\"Article 103820\"},\"PeriodicalIF\":9.9000,\"publicationDate\":\"2025-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advanced Engineering Informatics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S147403462500713X\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S147403462500713X","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

基础设施项目工期延误日益造成巨大的经济损失和运营风险，而传统的施工进度风险分析方法仍然依赖于主观的经验判断。虽然机器学习（ML）方法减轻了一些局限性，但主流方法侧重于使用小样本数据集进行宏观层面的预测，在很大程度上忽略了构建任务中的文本数据。为了解决这些问题，本研究提出了KLWLS-SVMR，这是一种新颖的ML模型，它集成了文本和数字特征来预测施工任务的持续时间。该模型通过主题建模对非结构化任务描述进行量化，通过pca -随机森林混合机制构建最优特征集，并将k-means聚类与局部加权支持向量回归相结合，提高了大规模数据集的预测精度和计算效率。通过与11种常用的基于140378个真实任务的数据集的ML方法进行比较，证明了该方法的优越性能。与表现最好的基准——极度随机树（R2=0.968）和自适应增强（R2=0.960）相比，该模型的R2值更高，达到0.973，计算时间分别减少了25.0%和79.7%。与原始特征集相比，采用混合机制制定的最优特征集训练后的KLWLS-SVMR的R2提高了11.3%，RMSE和MAE分别降低了53.5%和51.8%，计算效率显著提高了40.0%。严格的假设检验证实，所有经过最佳特征集训练的机器学习模型在预测性能改善方面都具有统计学显著性（p=0.006≪0.05）。这项工作通过为优化任务级资源调度和风险管理提供实用的技术途径，推进了机器学习在建筑工程中的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An efficient machine learning-based model for duration prediction of construction tasks with large-scale datasets

Infrastructure project delays increasingly cause substantial economic losses and operation risks, yet conventional methods for construction schedule risk analysis remain reliant on subjective empirical judgments. While machine learning (ML) methods have mitigated some limitations, prevailing approaches focus on macro-level predictions using small-sample datasets, largely neglecting textual data in construction tasks. To address these issues, this study proposes KLWLS-SVMR, a novel ML model that integrates textual and numerical features to predict construction task durations. The proposed model quantifies unstructured task descriptions through topic modeling, constructs optimal feature sets via a PCA-random forest hybrid mechanism, and integrates k-means clustering with locally weighted support vector regression to enhance prediction accuracy and computational efficiency for large-scale datasets. The superior performance of proposed method is demonstrated by comparison with 11 popularly used ML methods based on a dataset covering 140,378 real-world tasks. Compared to the best-performing benchmarks, Extremely Randomized Trees (

R^{2} = 0.968

) and Adaptive Boosting (

R^{2} = 0.960

), the proposed model achieves a higher

R^{2}

value of 0.973 while reducing computational time by 25.0% and 79.7%, respectively. Compared to the original feature set, the KLWLS-SVMR trained with optimal feature sets formulated by the proposed hybrid mechanism shows an 11.3% increase in

R^{2}

, 53.5% and 51.8% decrease in RMSE and MAE, respectively, while significantly improving computational efficiency by 40.0%. Rigorous hypothesis testing confirms that all ML models trained with the optimal feature sets exhibit statistical significance (

p = 0.006 ≪ 0.05

) for prediction performance improvement. This work advances ML applications in construction engineering by providing a practical technical pathway for optimizing task-level resource scheduling and risk management.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advanced Engineering Informatics 工程技术-工程：综合

CiteScore

12.40

自引率

18.20%

发文量

292

审稿时长

45 days

期刊介绍： Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.