Methodology for generating diverse geotechnical datasets using Monte Carlo simulation and genetic algorithms

IF 9.1 1区工程技术 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computer-Aided Civil and Infrastructure Engineering Pub Date : 2025-10-15 DOI:10.1111/mice.70106

Junghee Park, Hyung‐Koo Yoon

{"title":"Methodology for generating diverse geotechnical datasets using Monte Carlo simulation and genetic algorithms","authors":"Junghee Park, Hyung‐Koo Yoon","doi":"10.1111/mice.70106","DOIUrl":null,"url":null,"abstract":"The reliability of machine learning heavily depends on training data; however, in the field of geotechnical engineering, it is challenging to obtain diverse datasets due to economic and accessibility limitations. The aim of this study is to propose a method for generating data for use in the training phase of machine learning by combining Monte Carlo simulations and genetic algorithms. The original data sample is constructed using a 1 × 1 m grid for a slope, based on geotechnical properties measured in 23 regions, including soil cohesion, slope angle, soil density, soil depth, and friction angle. Based on the original sample, further predictions are made at an additional 1777 grid locations to estimate the spatial distribution of geotechnical properties across the entire slope. When a single variable is used as input, the log‐likelihood values (e.g., –5.4 to –144.5) are used only as relative indicators, not as absolute measures. The results are also compared to those generated using existing algorithms such as the synthetic minority oversampling technique and adaptive synthetic sampling. The data generated using the proposed method exhibits fewer duplicate values, broader distribution ranges, and greater diversity. To ensure that the generated data closely aligns with the statistical characteristics of the actual data, the combination of input variables is configured to maximize the log‐likelihood value. To achieve this, Pearson correlation values are referenced, and multivariate input variables are constructed using highly correlated factors. As a result of this approach, the log‐likelihood value increased by 21% to 96%. This study demonstrates that the method combining Monte Carlo simulations and genetic algorithms generates data with more diverse distributions, compared to existing methods. It also highlights that constructing multivariable input data is preferable for improving reliability.","PeriodicalId":156,"journal":{"name":"Computer-Aided Civil and Infrastructure Engineering","volume":"67 1","pages":""},"PeriodicalIF":9.1000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer-Aided Civil and Infrastructure Engineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1111/mice.70106","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

The reliability of machine learning heavily depends on training data; however, in the field of geotechnical engineering, it is challenging to obtain diverse datasets due to economic and accessibility limitations. The aim of this study is to propose a method for generating data for use in the training phase of machine learning by combining Monte Carlo simulations and genetic algorithms. The original data sample is constructed using a 1 × 1 m grid for a slope, based on geotechnical properties measured in 23 regions, including soil cohesion, slope angle, soil density, soil depth, and friction angle. Based on the original sample, further predictions are made at an additional 1777 grid locations to estimate the spatial distribution of geotechnical properties across the entire slope. When a single variable is used as input, the log‐likelihood values (e.g., –5.4 to –144.5) are used only as relative indicators, not as absolute measures. The results are also compared to those generated using existing algorithms such as the synthetic minority oversampling technique and adaptive synthetic sampling. The data generated using the proposed method exhibits fewer duplicate values, broader distribution ranges, and greater diversity. To ensure that the generated data closely aligns with the statistical characteristics of the actual data, the combination of input variables is configured to maximize the log‐likelihood value. To achieve this, Pearson correlation values are referenced, and multivariate input variables are constructed using highly correlated factors. As a result of this approach, the log‐likelihood value increased by 21% to 96%. This study demonstrates that the method combining Monte Carlo simulations and genetic algorithms generates data with more diverse distributions, compared to existing methods. It also highlights that constructing multivariable input data is preferable for improving reliability.

查看原文本刊更多论文

使用蒙特卡罗模拟和遗传算法生成不同岩土数据集的方法

机器学习的可靠性很大程度上依赖于训练数据；然而，在岩土工程领域，由于经济和可及性的限制，获得多样化的数据集是具有挑战性的。本研究的目的是通过结合蒙特卡罗模拟和遗传算法，提出一种生成用于机器学习训练阶段的数据的方法。原始数据样本是基于23个区域的岩土力学特性，包括土壤黏聚力、坡角、土壤密度、土壤深度和摩擦角，使用1 × 1 m的网格来构建边坡。在原始样本的基础上，在额外的1777个网格位置进行进一步预测，以估计整个边坡的岩土特性的空间分布。当使用单个变量作为输入时，对数似然值（例如，-5.4至-144.5）仅用作相对指标，而不是绝对度量。结果还与现有算法（如合成少数派过采样技术和自适应合成采样）产生的结果进行了比较。使用该方法生成的数据具有重复值少、分布范围广、多样性强的特点。为了确保生成的数据与实际数据的统计特征紧密一致，输入变量的组合被配置为最大化对数似然值。为了实现这一点，我们参考了Pearson相关值，并使用高度相关的因素构建了多变量输入变量。这种方法的结果是，对数似然值提高了21%，达到96%。本研究表明，与现有方法相比，将蒙特卡罗模拟与遗传算法相结合的方法产生的数据具有更多样化的分布。本文还强调了构造多变量输入数据对于提高可靠性是可取的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer-Aided Civil and Infrastructure Engineering 工程技术-工程：土木

CiteScore

17.60

自引率

19.80%

发文量

146

审稿时长

1 months

期刊介绍： Computer-Aided Civil and Infrastructure Engineering stands as a scholarly, peer-reviewed archival journal, serving as a vital link between advancements in computer technology and civil and infrastructure engineering. The journal serves as a distinctive platform for the publication of original articles, spotlighting novel computational techniques and inventive applications of computers. Specifically, it concentrates on recent progress in computer and information technologies, fostering the development and application of emerging computing paradigms. Encompassing a broad scope, the journal addresses bridge, construction, environmental, highway, geotechnical, structural, transportation, and water resources engineering. It extends its reach to the management of infrastructure systems, covering domains such as highways, bridges, pavements, airports, and utilities. The journal delves into areas like artificial intelligence, cognitive modeling, concurrent engineering, database management, distributed computing, evolutionary computing, fuzzy logic, genetic algorithms, geometric modeling, internet-based technologies, knowledge discovery and engineering, machine learning, mobile computing, multimedia technologies, networking, neural network computing, optimization and search, parallel processing, robotics, smart structures, software engineering, virtual reality, and visualization techniques.