{"title":"Methodology for generating diverse geotechnical datasets using Monte Carlo simulation and genetic algorithms","authors":"Junghee Park, Hyung‐Koo Yoon","doi":"10.1111/mice.70106","DOIUrl":null,"url":null,"abstract":"The reliability of machine learning heavily depends on training data; however, in the field of geotechnical engineering, it is challenging to obtain diverse datasets due to economic and accessibility limitations. The aim of this study is to propose a method for generating data for use in the training phase of machine learning by combining Monte Carlo simulations and genetic algorithms. The original data sample is constructed using a 1 × 1 m grid for a slope, based on geotechnical properties measured in 23 regions, including soil cohesion, slope angle, soil density, soil depth, and friction angle. Based on the original sample, further predictions are made at an additional 1777 grid locations to estimate the spatial distribution of geotechnical properties across the entire slope. When a single variable is used as input, the log‐likelihood values (e.g., –5.4 to –144.5) are used only as relative indicators, not as absolute measures. The results are also compared to those generated using existing algorithms such as the synthetic minority oversampling technique and adaptive synthetic sampling. The data generated using the proposed method exhibits fewer duplicate values, broader distribution ranges, and greater diversity. To ensure that the generated data closely aligns with the statistical characteristics of the actual data, the combination of input variables is configured to maximize the log‐likelihood value. To achieve this, Pearson correlation values are referenced, and multivariate input variables are constructed using highly correlated factors. As a result of this approach, the log‐likelihood value increased by 21% to 96%. This study demonstrates that the method combining Monte Carlo simulations and genetic algorithms generates data with more diverse distributions, compared to existing methods. It also highlights that constructing multivariable input data is preferable for improving reliability.","PeriodicalId":156,"journal":{"name":"Computer-Aided Civil and Infrastructure Engineering","volume":"67 1","pages":""},"PeriodicalIF":9.1000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer-Aided Civil and Infrastructure Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1111/mice.70106","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
The reliability of machine learning heavily depends on training data; however, in the field of geotechnical engineering, it is challenging to obtain diverse datasets due to economic and accessibility limitations. The aim of this study is to propose a method for generating data for use in the training phase of machine learning by combining Monte Carlo simulations and genetic algorithms. The original data sample is constructed using a 1 × 1 m grid for a slope, based on geotechnical properties measured in 23 regions, including soil cohesion, slope angle, soil density, soil depth, and friction angle. Based on the original sample, further predictions are made at an additional 1777 grid locations to estimate the spatial distribution of geotechnical properties across the entire slope. When a single variable is used as input, the log‐likelihood values (e.g., –5.4 to –144.5) are used only as relative indicators, not as absolute measures. The results are also compared to those generated using existing algorithms such as the synthetic minority oversampling technique and adaptive synthetic sampling. The data generated using the proposed method exhibits fewer duplicate values, broader distribution ranges, and greater diversity. To ensure that the generated data closely aligns with the statistical characteristics of the actual data, the combination of input variables is configured to maximize the log‐likelihood value. To achieve this, Pearson correlation values are referenced, and multivariate input variables are constructed using highly correlated factors. As a result of this approach, the log‐likelihood value increased by 21% to 96%. This study demonstrates that the method combining Monte Carlo simulations and genetic algorithms generates data with more diverse distributions, compared to existing methods. It also highlights that constructing multivariable input data is preferable for improving reliability.
期刊介绍:
Computer-Aided Civil and Infrastructure Engineering stands as a scholarly, peer-reviewed archival journal, serving as a vital link between advancements in computer technology and civil and infrastructure engineering. The journal serves as a distinctive platform for the publication of original articles, spotlighting novel computational techniques and inventive applications of computers. Specifically, it concentrates on recent progress in computer and information technologies, fostering the development and application of emerging computing paradigms.
Encompassing a broad scope, the journal addresses bridge, construction, environmental, highway, geotechnical, structural, transportation, and water resources engineering. It extends its reach to the management of infrastructure systems, covering domains such as highways, bridges, pavements, airports, and utilities. The journal delves into areas like artificial intelligence, cognitive modeling, concurrent engineering, database management, distributed computing, evolutionary computing, fuzzy logic, genetic algorithms, geometric modeling, internet-based technologies, knowledge discovery and engineering, machine learning, mobile computing, multimedia technologies, networking, neural network computing, optimization and search, parallel processing, robotics, smart structures, software engineering, virtual reality, and visualization techniques.