{"title":"An efficient machine learning-based model for duration prediction of construction tasks with large-scale datasets","authors":"Yaping Liu, Huan Luo","doi":"10.1016/j.aei.2025.103820","DOIUrl":null,"url":null,"abstract":"<div><div>Infrastructure project delays increasingly cause substantial economic losses and operation risks, yet conventional methods for construction schedule risk analysis remain reliant on subjective empirical judgments. While machine learning (ML) methods have mitigated some limitations, prevailing approaches focus on macro-level predictions using small-sample datasets, largely neglecting textual data in construction tasks. To address these issues, this study proposes KLWLS-SVMR, a novel ML model that integrates textual and numerical features to predict construction task durations. The proposed model quantifies unstructured task descriptions through topic modeling, constructs optimal feature sets via a PCA-random forest hybrid mechanism, and integrates k-means clustering with locally weighted support vector regression to enhance prediction accuracy and computational efficiency for large-scale datasets. The superior performance of proposed method is demonstrated by comparison with 11 popularly used ML methods based on a dataset covering 140,378 real-world tasks. Compared to the best-performing benchmarks, Extremely Randomized Trees (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>968</mn></mrow></math></span>) and Adaptive Boosting (<span><math><mrow><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>=</mo><mn>0</mn><mo>.</mo><mn>960</mn></mrow></math></span>), the proposed model achieves a higher <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> value of 0.973 while reducing computational time by 25.0% and 79.7%, respectively. Compared to the original feature set, the KLWLS-SVMR trained with optimal feature sets formulated by the proposed hybrid mechanism shows an 11.3% increase in <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span>, 53.5% and 51.8% decrease in RMSE and MAE, respectively, while significantly improving computational efficiency by 40.0%. Rigorous hypothesis testing confirms that all ML models trained with the optimal feature sets exhibit statistical significance (<span><math><mrow><mi>p</mi><mo>=</mo><mn>0</mn><mo>.</mo><mn>006</mn><mo>≪</mo><mn>0</mn><mo>.</mo><mn>05</mn></mrow></math></span>) for prediction performance improvement. This work advances ML applications in construction engineering by providing a practical technical pathway for optimizing task-level resource scheduling and risk management.</div></div>","PeriodicalId":50941,"journal":{"name":"Advanced Engineering Informatics","volume":"69 ","pages":"Article 103820"},"PeriodicalIF":9.9000,"publicationDate":"2025-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advanced Engineering Informatics","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S147403462500713X","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Infrastructure project delays increasingly cause substantial economic losses and operation risks, yet conventional methods for construction schedule risk analysis remain reliant on subjective empirical judgments. While machine learning (ML) methods have mitigated some limitations, prevailing approaches focus on macro-level predictions using small-sample datasets, largely neglecting textual data in construction tasks. To address these issues, this study proposes KLWLS-SVMR, a novel ML model that integrates textual and numerical features to predict construction task durations. The proposed model quantifies unstructured task descriptions through topic modeling, constructs optimal feature sets via a PCA-random forest hybrid mechanism, and integrates k-means clustering with locally weighted support vector regression to enhance prediction accuracy and computational efficiency for large-scale datasets. The superior performance of proposed method is demonstrated by comparison with 11 popularly used ML methods based on a dataset covering 140,378 real-world tasks. Compared to the best-performing benchmarks, Extremely Randomized Trees () and Adaptive Boosting (), the proposed model achieves a higher value of 0.973 while reducing computational time by 25.0% and 79.7%, respectively. Compared to the original feature set, the KLWLS-SVMR trained with optimal feature sets formulated by the proposed hybrid mechanism shows an 11.3% increase in , 53.5% and 51.8% decrease in RMSE and MAE, respectively, while significantly improving computational efficiency by 40.0%. Rigorous hypothesis testing confirms that all ML models trained with the optimal feature sets exhibit statistical significance () for prediction performance improvement. This work advances ML applications in construction engineering by providing a practical technical pathway for optimizing task-level resource scheduling and risk management.
期刊介绍:
Advanced Engineering Informatics is an international Journal that solicits research papers with an emphasis on 'knowledge' and 'engineering applications'. The Journal seeks original papers that report progress in applying methods of engineering informatics. These papers should have engineering relevance and help provide a scientific base for more reliable, spontaneous, and creative engineering decision-making. Additionally, papers should demonstrate the science of supporting knowledge-intensive engineering tasks and validate the generality, power, and scalability of new methods through rigorous evaluation, preferably both qualitatively and quantitatively. Abstracting and indexing for Advanced Engineering Informatics include Science Citation Index Expanded, Scopus and INSPEC.