Application of SVC, k-NN, and LDA machine learning algorithms for improved prediction of Bioturbation: Example from the Subei Basin, China

IF 3 4区地球科学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Earth Science Informatics Pub Date : 2024-08-28 DOI:10.1007/s12145-024-01450-z

Jonathan Atuquaye Quaye, Kwame Sarkodie, Zaixing Jiang, Chenlin Hu, Joshua Agbanu, Stephen Adjei, Baiqiang Li

{"title":"Application of SVC, k-NN, and LDA machine learning algorithms for improved prediction of Bioturbation: Example from the Subei Basin, China","authors":"Jonathan Atuquaye Quaye, Kwame Sarkodie, Zaixing Jiang, Chenlin Hu, Joshua Agbanu, Stephen Adjei, Baiqiang Li","doi":"10.1007/s12145-024-01450-z","DOIUrl":null,"url":null,"abstract":"<p>Three supervised machine learning (ML) classification algorithms: Support Vector Classifier (SVC), K- Nearest Neighbour (K-NN), and Linear Discriminant Analysis (LDA) classification algorithms are combined with seventy-six (76) data points of nine (9) core sample datasets retrieved from five (5) selected wells in oilfields of the Subei Basin to delineate bioturbation. Application of feature selection via p-score and f-scoring reduced the number of relevant features to 7 out of the 12 considered. Each classifier underwent model training and testing allocating 80% of the data for training and the remaining 20% for testing. Under the model training, optimization of hyperparameters of the SVC (C, Gamma and Kernel) and K-NN (K value) was performed via the grid search to understand the best form of the decision boundaries that provides optimal accuracy of prediction of Bioturbation. Results aided the selection of optimized SVC hyperparameters such as a linear kernel, C-1000 and Gamma parameter—0.10 that provided a training accuracy of 96.17%. The optimized KNN classifier was obtained based on the K = 5 nearest neighbour to obtain a training accuracy of 73.28%. The training accuracy of the LDA classifier was 67.36% which made it the worst-performing classifier in this work. Further cross-validation based on a fivefold stratification was performed on each classifier to ascertain model generalization and stability for the prediction of unseen test data. Results of the test performance of each classifier indicated that the SVC was the best predictor of the bioturbation index at 92.86% accuracy, followed by the K-NN model at 90.48%, and then the LDA classifier which gave the lowest test accuracy at 76.2%. The results of this work indicate that bioturbation can be predicted via ML methods which is a more efficient and effective means of rock characterization compared to conventional methods used in the oil and gas industry.</p>","PeriodicalId":49318,"journal":{"name":"Earth Science Informatics","volume":"24 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Earth Science Informatics","FirstCategoryId":"89","ListUrlMain":"https://doi.org/10.1007/s12145-024-01450-z","RegionNum":4,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Three supervised machine learning (ML) classification algorithms: Support Vector Classifier (SVC), K- Nearest Neighbour (K-NN), and Linear Discriminant Analysis (LDA) classification algorithms are combined with seventy-six (76) data points of nine (9) core sample datasets retrieved from five (5) selected wells in oilfields of the Subei Basin to delineate bioturbation. Application of feature selection via p-score and f-scoring reduced the number of relevant features to 7 out of the 12 considered. Each classifier underwent model training and testing allocating 80% of the data for training and the remaining 20% for testing. Under the model training, optimization of hyperparameters of the SVC (C, Gamma and Kernel) and K-NN (K value) was performed via the grid search to understand the best form of the decision boundaries that provides optimal accuracy of prediction of Bioturbation. Results aided the selection of optimized SVC hyperparameters such as a linear kernel, C-1000 and Gamma parameter—0.10 that provided a training accuracy of 96.17%. The optimized KNN classifier was obtained based on the K = 5 nearest neighbour to obtain a training accuracy of 73.28%. The training accuracy of the LDA classifier was 67.36% which made it the worst-performing classifier in this work. Further cross-validation based on a fivefold stratification was performed on each classifier to ascertain model generalization and stability for the prediction of unseen test data. Results of the test performance of each classifier indicated that the SVC was the best predictor of the bioturbation index at 92.86% accuracy, followed by the K-NN model at 90.48%, and then the LDA classifier which gave the lowest test accuracy at 76.2%. The results of this work indicate that bioturbation can be predicted via ML methods which is a more efficient and effective means of rock characterization compared to conventional methods used in the oil and gas industry.

Abstract Image

查看原文本刊更多论文

应用 SVC、k-NN 和 LDA 机器学习算法改进生物扰动预测：以中国苏北盆地为例

三种有监督的机器学习（ML）分类算法：支持向量分类器（SVC）、K-近邻（K-NN）和线性判别分析（LDA）分类算法与从苏北盆地油田五（5）口选定油井中提取的九（9）个岩心样本数据集的 76 个数据点相结合，对生物扰动进行了划分。通过 p 分数和 f 分数进行特征选择，将相关特征的数量减少到 12 个中的 7 个。每个分类器都进行了模型训练和测试，其中 80% 的数据用于训练，其余 20% 用于测试。在模型训练过程中，通过网格搜索对 SVC 的超参数（C、Gamma 和核）和 K-NN 的超参数（K 值）进行了优化，以了解决策边界的最佳形式，从而提供最佳的生物扰动预测精度。结果帮助选择了优化的 SVC 超参数，如线性核、C-1000 和 Gamma 参数-0.10，使训练准确率达到 96.17%。优化的 KNN 分类器基于 K = 5 近邻，训练准确率为 73.28%。LDA 分类器的训练准确率为 67.36%，是本研究中表现最差的分类器。对每个分类器都进行了基于五重分层的进一步交叉验证，以确定模型在预测未见测试数据时的泛化和稳定性。各分类器的测试结果表明，SVC 是生物扰动指数的最佳预测器，准确率为 92.86%，其次是 K-NN 模型，准确率为 90.48%，然后是 LDA 分类器，测试准确率最低，为 76.2%。这项工作的结果表明，通过 ML 方法可以预测生物扰动，与油气行业使用的传统方法相比，这是一种更高效、更有效的岩石表征方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Earth Science Informatics COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-GEOSCIENCES, MULTIDISCIPLINARY

CiteScore

4.60

自引率

3.60%

发文量

157

审稿时长

4.3 months

期刊介绍： The Earth Science Informatics [ESIN] journal aims at rapid publication of high-quality, current, cutting-edge, and provocative scientific work in the area of Earth Science Informatics as it relates to Earth systems science and space science. This includes articles on the application of formal and computational methods, computational Earth science, spatial and temporal analyses, and all aspects of computer applications to the acquisition, storage, processing, interchange, and visualization of data and information about the materials, properties, processes, features, and phenomena that occur at all scales and locations in the Earth system’s five components (atmosphere, hydrosphere, geosphere, biosphere, cryosphere) and in space (see "About this journal" for more detail). The quarterly journal publishes research, methodology, and software articles, as well as editorials, comments, and book and software reviews. Review articles of relevant findings, topics, and methodologies are also considered.