Lulu Xu , Zhengyang Du , Meifeng Cai , Shangxian Yin , Shuning Dong , Hung Vo Thanh , Kenneth C. Carroll , Mohamad Reza Soltanian , Zhenxue Dai
{"title":"Sparse data-driven knowledge discovery for interpretable prediction of permeability in tight sandstones","authors":"Lulu Xu , Zhengyang Du , Meifeng Cai , Shangxian Yin , Shuning Dong , Hung Vo Thanh , Kenneth C. Carroll , Mohamad Reza Soltanian , Zhenxue Dai","doi":"10.1016/j.enggeo.2025.108151","DOIUrl":null,"url":null,"abstract":"<div><div>Permeability (<em>k</em>) is crucial for subsurface fluid flow, but predicting <em>k</em>-values in tight sandstones remains challenging due to their complex pore structure and heterogeneity. Although machine learning (ML) has shown promise, it faces significant challenges, including limited high-quality data, high computational costs, and unclear prediction mechanisms. This study proposes a sparse data-driven knowledge discovery framework aimed at enhancing the accuracy and interpretability of <em>k</em>-value predictions in tight sandstone formations. We integrate ML models with data augmentation (ML-DA), using Extreme Gradient Boosting (XGBoost-DA) and Least Squares Support Vector Regression (LSSVR-DA), optimized through genetic algorithms (GA), particle swarm optimization (PSO), and Bayesian optimization (BO). SHapley Additive Explanations (SHAP) are employed to elucidate the interactions between key factors influencing predictions. Monte Carlo simulations demonstrate the robust performance of our ML-DA models, even under data constraints. SHAP analysis identifies key predictors, including porosity, displacement pressure, median pore throat radius, median pressure, and carbonate content. Partial dependence plots (PDPs) reveal a significant interaction between porosity and carbonate content, as well as a decrease in model stability at low carbonate content. This study presents an interpretable ML framework with data augmentation, enabling improved predictions from sparse data while exploring the interactions between key factors. The framework can be adapted to other domains facing similar challenges, enhancing the accuracy and transparency of model predictions.</div></div>","PeriodicalId":11567,"journal":{"name":"Engineering Geology","volume":"353 ","pages":"Article 108151"},"PeriodicalIF":8.4000,"publicationDate":"2025-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Geology","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0013795225002479","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, GEOLOGICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Permeability (k) is crucial for subsurface fluid flow, but predicting k-values in tight sandstones remains challenging due to their complex pore structure and heterogeneity. Although machine learning (ML) has shown promise, it faces significant challenges, including limited high-quality data, high computational costs, and unclear prediction mechanisms. This study proposes a sparse data-driven knowledge discovery framework aimed at enhancing the accuracy and interpretability of k-value predictions in tight sandstone formations. We integrate ML models with data augmentation (ML-DA), using Extreme Gradient Boosting (XGBoost-DA) and Least Squares Support Vector Regression (LSSVR-DA), optimized through genetic algorithms (GA), particle swarm optimization (PSO), and Bayesian optimization (BO). SHapley Additive Explanations (SHAP) are employed to elucidate the interactions between key factors influencing predictions. Monte Carlo simulations demonstrate the robust performance of our ML-DA models, even under data constraints. SHAP analysis identifies key predictors, including porosity, displacement pressure, median pore throat radius, median pressure, and carbonate content. Partial dependence plots (PDPs) reveal a significant interaction between porosity and carbonate content, as well as a decrease in model stability at low carbonate content. This study presents an interpretable ML framework with data augmentation, enabling improved predictions from sparse data while exploring the interactions between key factors. The framework can be adapted to other domains facing similar challenges, enhancing the accuracy and transparency of model predictions.
期刊介绍:
Engineering Geology, an international interdisciplinary journal, serves as a bridge between earth sciences and engineering, focusing on geological and geotechnical engineering. It welcomes studies with relevance to engineering, environmental concerns, and safety, catering to engineering geologists with backgrounds in geology or civil/mining engineering. Topics include applied geomorphology, structural geology, geophysics, geochemistry, environmental geology, hydrogeology, land use planning, natural hazards, remote sensing, soil and rock mechanics, and applied geotechnical engineering. The journal provides a platform for research at the intersection of geology and engineering disciplines.