{"title":"盾构施工大数据质量指标研究","authors":"Chao Zhang , Yuhao Ren , Qihang Huang , Renpeng Chen","doi":"10.1016/j.engappai.2025.111023","DOIUrl":null,"url":null,"abstract":"<div><div>The quality of the dataset underpinning the data-driven models predefines the upper limit for their performance yet lacks a quantitative way to be captured for the construction big data generated in earth pressure balance, i.e., EPB, shield tunneling. Herein, a quality index is proposed to fill this gap and formulated as an <span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> norm of a vector composed of three components, i.e., accuracy, inclusiveness, and informativeness. The accuracy component is the ratio of non-outlier samples, i.e., a dataset containing fewer outliers shows a higher accuracy, reflecting the extent to which the dataset represents the real construction conditions during the tunneling. The inclusiveness component is the normalized envelope area of the dataset being mapped into a two-dimensional space, reflecting the range of diverse construction scenarios that have been included in the dataset. The informativeness component is the dimensionless uncertainty reduction of given data-driven models by the dataset, reflecting the contribution of datasets to the given model’s prediction. The proposed quality index is comprehensively assessed using a big database collected from multiple tunneling projects. A series of sub-datasets deliberately divided from the big database are utilized to train data-driven models by three commonly used algorithms, i.e., random forest, neural network, and K-nearest neighbors, for mapping three target functions widely concerned in tunneling, i.e., torque, thrust, and penetration. It is shown that the proposed quality index of the training data unfailingly excellently correlates with the performance of the data-driven models (R-values <span><math><mo>></mo></math></span> 0.91) regardless of algorithms, target functions, and sample sizes.The proposed quality index serves as a theoretical basis for a series of practical application scenarios, e.g., training data selection, and core dataset development. A practical application based on the Changsha project illustrated that the training dataset selected using the quality index can significantly boost the performance of the developed data-driven models by more than 38% and reduce training time by more than 26%.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"156 ","pages":"Article 111023"},"PeriodicalIF":7.5000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A quality index for construction big data in shield tunneling\",\"authors\":\"Chao Zhang , Yuhao Ren , Qihang Huang , Renpeng Chen\",\"doi\":\"10.1016/j.engappai.2025.111023\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The quality of the dataset underpinning the data-driven models predefines the upper limit for their performance yet lacks a quantitative way to be captured for the construction big data generated in earth pressure balance, i.e., EPB, shield tunneling. Herein, a quality index is proposed to fill this gap and formulated as an <span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>2</mn></mrow></msub></math></span> norm of a vector composed of three components, i.e., accuracy, inclusiveness, and informativeness. The accuracy component is the ratio of non-outlier samples, i.e., a dataset containing fewer outliers shows a higher accuracy, reflecting the extent to which the dataset represents the real construction conditions during the tunneling. The inclusiveness component is the normalized envelope area of the dataset being mapped into a two-dimensional space, reflecting the range of diverse construction scenarios that have been included in the dataset. The informativeness component is the dimensionless uncertainty reduction of given data-driven models by the dataset, reflecting the contribution of datasets to the given model’s prediction. The proposed quality index is comprehensively assessed using a big database collected from multiple tunneling projects. A series of sub-datasets deliberately divided from the big database are utilized to train data-driven models by three commonly used algorithms, i.e., random forest, neural network, and K-nearest neighbors, for mapping three target functions widely concerned in tunneling, i.e., torque, thrust, and penetration. It is shown that the proposed quality index of the training data unfailingly excellently correlates with the performance of the data-driven models (R-values <span><math><mo>></mo></math></span> 0.91) regardless of algorithms, target functions, and sample sizes.The proposed quality index serves as a theoretical basis for a series of practical application scenarios, e.g., training data selection, and core dataset development. A practical application based on the Changsha project illustrated that the training dataset selected using the quality index can significantly boost the performance of the developed data-driven models by more than 38% and reduce training time by more than 26%.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"156 \",\"pages\":\"Article 111023\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2025-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625010231\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625010231","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
A quality index for construction big data in shield tunneling
The quality of the dataset underpinning the data-driven models predefines the upper limit for their performance yet lacks a quantitative way to be captured for the construction big data generated in earth pressure balance, i.e., EPB, shield tunneling. Herein, a quality index is proposed to fill this gap and formulated as an norm of a vector composed of three components, i.e., accuracy, inclusiveness, and informativeness. The accuracy component is the ratio of non-outlier samples, i.e., a dataset containing fewer outliers shows a higher accuracy, reflecting the extent to which the dataset represents the real construction conditions during the tunneling. The inclusiveness component is the normalized envelope area of the dataset being mapped into a two-dimensional space, reflecting the range of diverse construction scenarios that have been included in the dataset. The informativeness component is the dimensionless uncertainty reduction of given data-driven models by the dataset, reflecting the contribution of datasets to the given model’s prediction. The proposed quality index is comprehensively assessed using a big database collected from multiple tunneling projects. A series of sub-datasets deliberately divided from the big database are utilized to train data-driven models by three commonly used algorithms, i.e., random forest, neural network, and K-nearest neighbors, for mapping three target functions widely concerned in tunneling, i.e., torque, thrust, and penetration. It is shown that the proposed quality index of the training data unfailingly excellently correlates with the performance of the data-driven models (R-values 0.91) regardless of algorithms, target functions, and sample sizes.The proposed quality index serves as a theoretical basis for a series of practical application scenarios, e.g., training data selection, and core dataset development. A practical application based on the Changsha project illustrated that the training dataset selected using the quality index can significantly boost the performance of the developed data-driven models by more than 38% and reduce training time by more than 26%.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.