Advanced EOR screening methodology based on LightGBM and random forest: A classification problem with imbalanced data

IF 1.6 4区工程技术 Q3 ENGINEERING, CHEMICAL

Canadian Journal of Chemical Engineering Pub Date : 2024-08-08 DOI:10.1002/cjce.25433

Masoud Seyyedattar, Majid Afshar, Sohrab Zendehboudi, Stephen Butt

{"title":"Advanced EOR screening methodology based on LightGBM and random forest: A classification problem with imbalanced data","authors":"Masoud Seyyedattar, Majid Afshar, Sohrab Zendehboudi, Stephen Butt","doi":"10.1002/cjce.25433","DOIUrl":null,"url":null,"abstract":"<p>In an unstable oil market with volatile prices due to various natural and geopolitical factors, it is crucial for oil-producing companies to enhance the value of their assets by improving the recovery factors of petroleum reservoirs. Primary recovery through natural depletion or artificial lift and secondary recovery using waterflooding and immiscible gas injection typically recover no more than 10%–40% of the available reserves. A significant portion of the hydrocarbons remain unproduced if enhanced oil recovery (EOR) methods are not implemented. EOR projects are extremely costly, complex, and usually have long lead times from the decision-making and design phases to pilot and full-field implementations. Therefore, oil and gas operator companies need reliable insights into the best possible EOR options from the early stages of any field development planning. Since screening potential EOR choices is the first step in deciding future production scenarios, a smart EOR screening tool can add significant value by streamlining the EOR decision-making process. In this study, we developed an EOR screening tool based on two advanced machine learning classification algorithms, random forest and light gradient boosting machine (LightGBM). These tree-based ensemble learning classifiers were trained on an extensive dataset of 1384 worldwide EOR implementations, encompassing various reservoir conditions and reservoir rock and fluid properties as the feature space, to predict the EOR type as the class label. Considering EOR screening as a classification problem, an essential aspect of model development would be addressing the data imbalance of EOR datasets. To tackle this issue, the adaptive synthetic (ADASYN) sampling method was used to reduce classification bias by oversampling the training sets to achieve uniform class distributions. We designed an iterative model development procedure in which the classifiers were trained and tested on various training and test subsets split by stratified random sampling. For each classifier, the classification results at each iteration were used to build the confusion matrix and calculate model evaluation metrics (accuracy, precision, recall, and F1–score), which were then averaged over all independent runs to provide a fair assessment of classification performance. Moreover, binary receiver operating characteristic (ROC) curves were used to evaluate the classifier predictions and improvements obtained by oversampling. The results showed that both random forest and LightGBM classifiers made accurate class predictions, with LightGBM achieving slightly better classification performance in each modelling scenario (with or without oversampling). In both cases, the oversampling of the training dataset resulted in significant improvement of the classifiers, as evidenced by higher values of the evaluation metrics, leading to considerably more accurate EOR type predictions; specifically, oversampling boosted the prediction accuracy of the random forest model from 78.3% to 89.5% and the LightGBM model from 77.5% to 90.2%. Additionally, feature importance rankings provided valuable insights into which input variables had the greatest impact on model development.</p>","PeriodicalId":9400,"journal":{"name":"Canadian Journal of Chemical Engineering","volume":"103 2","pages":"846-867"},"PeriodicalIF":1.6000,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Canadian Journal of Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cjce.25433","RegionNum":4,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}

引用次数: 0

Abstract

In an unstable oil market with volatile prices due to various natural and geopolitical factors, it is crucial for oil-producing companies to enhance the value of their assets by improving the recovery factors of petroleum reservoirs. Primary recovery through natural depletion or artificial lift and secondary recovery using waterflooding and immiscible gas injection typically recover no more than 10%–40% of the available reserves. A significant portion of the hydrocarbons remain unproduced if enhanced oil recovery (EOR) methods are not implemented. EOR projects are extremely costly, complex, and usually have long lead times from the decision-making and design phases to pilot and full-field implementations. Therefore, oil and gas operator companies need reliable insights into the best possible EOR options from the early stages of any field development planning. Since screening potential EOR choices is the first step in deciding future production scenarios, a smart EOR screening tool can add significant value by streamlining the EOR decision-making process. In this study, we developed an EOR screening tool based on two advanced machine learning classification algorithms, random forest and light gradient boosting machine (LightGBM). These tree-based ensemble learning classifiers were trained on an extensive dataset of 1384 worldwide EOR implementations, encompassing various reservoir conditions and reservoir rock and fluid properties as the feature space, to predict the EOR type as the class label. Considering EOR screening as a classification problem, an essential aspect of model development would be addressing the data imbalance of EOR datasets. To tackle this issue, the adaptive synthetic (ADASYN) sampling method was used to reduce classification bias by oversampling the training sets to achieve uniform class distributions. We designed an iterative model development procedure in which the classifiers were trained and tested on various training and test subsets split by stratified random sampling. For each classifier, the classification results at each iteration were used to build the confusion matrix and calculate model evaluation metrics (accuracy, precision, recall, and F1–score), which were then averaged over all independent runs to provide a fair assessment of classification performance. Moreover, binary receiver operating characteristic (ROC) curves were used to evaluate the classifier predictions and improvements obtained by oversampling. The results showed that both random forest and LightGBM classifiers made accurate class predictions, with LightGBM achieving slightly better classification performance in each modelling scenario (with or without oversampling). In both cases, the oversampling of the training dataset resulted in significant improvement of the classifiers, as evidenced by higher values of the evaluation metrics, leading to considerably more accurate EOR type predictions; specifically, oversampling boosted the prediction accuracy of the random forest model from 78.3% to 89.5% and the LightGBM model from 77.5% to 90.2%. Additionally, feature importance rankings provided valuable insights into which input variables had the greatest impact on model development.

查看原文本刊更多论文

基于 LightGBM 和随机森林的先进 EOR 筛选方法：不平衡数据的分类问题

由于各种自然和地缘政治因素，石油市场价格不稳定，因此，对于石油生产公司来说，通过提高石油储层的采收率来提升其资产价值至关重要。通过自然耗竭或人工举升进行的一次采油，以及通过注水和注入不相溶气体进行的二次采油，其采收率通常不超过可用储量的 10%-40%。如果不采用提高石油采收率（EOR）的方法，很大一部分碳氢化合物将无法开采。EOR 项目成本极高、非常复杂，从决策和设计阶段到试验和全油田实施，通常需要很长时间。因此，石油和天然气运营商公司需要在任何油田开发规划的早期阶段就对最佳 EOR 方案有可靠的了解。由于筛选潜在的 EOR 选择是决定未来生产方案的第一步，智能 EOR 筛选工具可以简化 EOR 决策过程，从而带来巨大价值。在本研究中，我们基于两种先进的机器学习分类算法--随机森林和轻梯度提升机（LightGBM），开发了一种 EOR 筛选工具。这些基于树的集合学习分类器在全球 1384 个 EOR 实施的广泛数据集上进行了训练，涵盖了各种储层条件、储层岩石和流体特性作为特征空间，以预测作为类标签的 EOR 类型。考虑到 EOR 筛选是一个分类问题，模型开发的一个重要方面是解决 EOR 数据集的数据不平衡问题。为了解决这个问题，我们采用了自适应合成（ADASYN）采样方法，通过对训练集进行超采样来实现统一的类分布，从而减少分类偏差。我们设计了一个迭代模型开发流程，在该流程中，分类器在通过分层随机抽样分割的各种训练和测试子集中进行训练和测试。对于每个分类器，每次迭代的分类结果都用于建立混淆矩阵和计算模型评估指标（准确率、精确度、召回率和 F1-分数），然后对所有独立运行进行平均，以提供对分类性能的公平评估。此外，还使用二元接收器操作特征曲线（ROC）来评估分类器的预测结果以及通过超采样获得的改进。结果表明，随机森林分类器和 LightGBM 分类器都能准确预测类别，其中 LightGBM 在每种建模情况下（无论是否超采样）的分类性能都略胜一筹。在这两种情况下，训练数据集的超采样都显著提高了分类器的性能，评估指标值的提高就证明了这一点，从而使 EOR 类型预测的准确性大大提高；具体而言，超采样使随机森林模型的预测准确率从 78.3% 提高到 89.5%，LightGBM 模型的预测准确率从 77.5% 提高到 90.2%。此外，特征重要性排名为了解哪些输入变量对模型开发影响最大提供了宝贵的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Canadian Journal of Chemical Engineering 工程技术-工程：化工

CiteScore

3.60

自引率

14.30%

发文量

448

审稿时长

3.2 months

期刊介绍： The Canadian Journal of Chemical Engineering (CJChE) publishes original research articles, new theoretical interpretation or experimental findings and critical reviews in the science or industrial practice of chemical and biochemical processes. Preference is given to papers having a clearly indicated scope and applicability in any of the following areas: Fluid mechanics, heat and mass transfer, multiphase flows, separations processes, thermodynamics, process systems engineering, reactors and reaction kinetics, catalysis, interfacial phenomena, electrochemical phenomena, bioengineering, minerals processing and natural products and environmental and energy engineering. Papers that merely describe or present a conventional or routine analysis of existing processes will not be considered.