Do more with less: Exploring semi-supervised learning for geological image classification

IF 3.2 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Applied Computing and Geosciences Pub Date : 2025-02-01 DOI:10.1016/j.acags.2024.100216

Hisham I. Mamode, Gary J. Hampson, Cédric M. John

{"title":"Do more with less: Exploring semi-supervised learning for geological image classification","authors":"Hisham I. Mamode, Gary J. Hampson, Cédric M. John","doi":"10.1016/j.acags.2024.100216","DOIUrl":null,"url":null,"abstract":"<div><div>Labelled datasets within geoscience can often be small, with data acquisition both costly and challenging, and their interpretation and downstream use in machine learning difficult due to data scarcity. Deep learning algorithms require large datasets to learn a robust relationship between the data and its label and avoid overfitting. To overcome the paucity of data, transfer learning has been employed in classification tasks. But an alternative exists: there often is a large corpus of unlabeled data which may enhance the learning process. To evaluate this potential for subsurface data, we compare a high-performance semi-supervised learning (SSL) algorithm (SimCLRv2) with supervised transfer learning on a Convolutional Neural Network (CNN) in geological image classification.</div><div>We tested the two approaches on a classification task of sediment disturbance from cores of International Ocean Drilling Program (IODP) Expeditions 383 and 385. Our results show that semi-supervised transfer learning can be an effective strategy to adopt, with SimCLRv2 capable of producing representations comparable to those of supervised transfer learning. However attempts to enhance the performance of semi-supervised transfer learning with task-specific unlabeled images during self-supervision degraded representations. Significantly, we demonstrate that SimCLRv2 trained on a dataset of core disturbance images can out-perform supervised transfer learning of a CNN once a critical number of task-specific unlabeled images are available for self-supervision. The gain in performance compared to supervised transfer learning is 1% and 3% for binary and multi-class classification, respectively.</div><div>Supervised transfer learning can be deployed with comparative ease, whereas the current SSL algorithms such as SimCLRv2 require more effort. We recommend that SSL be explored in cases when large amounts of unlabeled task-specific images exist and improvement of a few percent in metrics matter. When examining small, highly specialized datasets, without large amounts of unlabeled images, supervised transfer learning might be the best strategy to adopt. Overall, SSL is a promising approach and future work should explore this approach utilizing different dataset types, quantity, and quality.</div></div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"25 ","pages":"Article 100216"},"PeriodicalIF":3.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590197424000636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Labelled datasets within geoscience can often be small, with data acquisition both costly and challenging, and their interpretation and downstream use in machine learning difficult due to data scarcity. Deep learning algorithms require large datasets to learn a robust relationship between the data and its label and avoid overfitting. To overcome the paucity of data, transfer learning has been employed in classification tasks. But an alternative exists: there often is a large corpus of unlabeled data which may enhance the learning process. To evaluate this potential for subsurface data, we compare a high-performance semi-supervised learning (SSL) algorithm (SimCLRv2) with supervised transfer learning on a Convolutional Neural Network (CNN) in geological image classification.

We tested the two approaches on a classification task of sediment disturbance from cores of International Ocean Drilling Program (IODP) Expeditions 383 and 385. Our results show that semi-supervised transfer learning can be an effective strategy to adopt, with SimCLRv2 capable of producing representations comparable to those of supervised transfer learning. However attempts to enhance the performance of semi-supervised transfer learning with task-specific unlabeled images during self-supervision degraded representations. Significantly, we demonstrate that SimCLRv2 trained on a dataset of core disturbance images can out-perform supervised transfer learning of a CNN once a critical number of task-specific unlabeled images are available for self-supervision. The gain in performance compared to supervised transfer learning is 1% and 3% for binary and multi-class classification, respectively.

Supervised transfer learning can be deployed with comparative ease, whereas the current SSL algorithms such as SimCLRv2 require more effort. We recommend that SSL be explored in cases when large amounts of unlabeled task-specific images exist and improvement of a few percent in metrics matter. When examining small, highly specialized datasets, without large amounts of unlabeled images, supervised transfer learning might be the best strategy to adopt. Overall, SSL is a promising approach and future work should explore this approach utilizing different dataset types, quantity, and quality.

查看原文本刊更多论文

少花钱多办事：探索地质图像分类的半监督学习

地球科学中的标记数据集通常很小，数据采集既昂贵又具有挑战性，而且由于数据稀缺，它们的解释和在机器学习中的下游使用也很困难。深度学习算法需要大型数据集来学习数据与其标签之间的稳健关系，并避免过拟合。为了克服数据的缺乏，迁移学习被用于分类任务。但另一种选择是存在的：通常存在大量未标记数据的语料库，这可能会增强学习过程。为了评估地下数据的潜力，我们比较了高性能半监督学习（SSL）算法（SimCLRv2）与卷积神经网络（CNN）上的监督迁移学习在地质图像分类中的应用。我们在国际海洋钻探计划（IODP）远征383和385岩心沉积物扰动的分类任务中测试了这两种方法。我们的研究结果表明，半监督迁移学习可以是一种有效的策略，SimCLRv2能够产生与监督迁移学习相当的表示。然而，试图在自我监督过程中使用特定任务的未标记图像来提高半监督迁移学习的性能会降低表征。值得注意的是，我们证明了在核心干扰图像数据集上训练的SimCLRv2可以胜过CNN的监督迁移学习，一旦有临界数量的特定任务的未标记图像可用于自我监督。与监督迁移学习相比，在二元分类和多类分类中，性能的提高分别为1%和3%。有监督的迁移学习可以相对容易地部署，而当前的SSL算法（如SimCLRv2）则需要更多的努力。我们建议在存在大量未标记的特定于任务的图像并且度量提高几个百分点很重要的情况下探索SSL。当检查小型的、高度专业化的数据集，没有大量未标记的图像时，监督迁移学习可能是最好的策略。总的来说，SSL是一种很有前途的方法，未来的工作应该利用不同的数据集类型、数量和质量来探索这种方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊