{"title":"Do more with less: Exploring semi-supervised learning for geological image classification","authors":"Hisham I. Mamode, Gary J. Hampson, Cédric M. John","doi":"10.1016/j.acags.2024.100216","DOIUrl":null,"url":null,"abstract":"<div><div>Labelled datasets within geoscience can often be small, with data acquisition both costly and challenging, and their interpretation and downstream use in machine learning difficult due to data scarcity. Deep learning algorithms require large datasets to learn a robust relationship between the data and its label and avoid overfitting. To overcome the paucity of data, transfer learning has been employed in classification tasks. But an alternative exists: there often is a large corpus of unlabeled data which may enhance the learning process. To evaluate this potential for subsurface data, we compare a high-performance semi-supervised learning (SSL) algorithm (SimCLRv2) with supervised transfer learning on a Convolutional Neural Network (CNN) in geological image classification.</div><div>We tested the two approaches on a classification task of sediment disturbance from cores of International Ocean Drilling Program (IODP) Expeditions 383 and 385. Our results show that semi-supervised transfer learning can be an effective strategy to adopt, with SimCLRv2 capable of producing representations comparable to those of supervised transfer learning. However attempts to enhance the performance of semi-supervised transfer learning with task-specific unlabeled images during self-supervision degraded representations. Significantly, we demonstrate that SimCLRv2 trained on a dataset of core disturbance images can out-perform supervised transfer learning of a CNN once a critical number of task-specific unlabeled images are available for self-supervision. The gain in performance compared to supervised transfer learning is 1% and 3% for binary and multi-class classification, respectively.</div><div>Supervised transfer learning can be deployed with comparative ease, whereas the current SSL algorithms such as SimCLRv2 require more effort. We recommend that SSL be explored in cases when large amounts of unlabeled task-specific images exist and improvement of a few percent in metrics matter. When examining small, highly specialized datasets, without large amounts of unlabeled images, supervised transfer learning might be the best strategy to adopt. Overall, SSL is a promising approach and future work should explore this approach utilizing different dataset types, quantity, and quality.</div></div>","PeriodicalId":33804,"journal":{"name":"Applied Computing and Geosciences","volume":"25 ","pages":"Article 100216"},"PeriodicalIF":2.6000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Geosciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590197424000636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Labelled datasets within geoscience can often be small, with data acquisition both costly and challenging, and their interpretation and downstream use in machine learning difficult due to data scarcity. Deep learning algorithms require large datasets to learn a robust relationship between the data and its label and avoid overfitting. To overcome the paucity of data, transfer learning has been employed in classification tasks. But an alternative exists: there often is a large corpus of unlabeled data which may enhance the learning process. To evaluate this potential for subsurface data, we compare a high-performance semi-supervised learning (SSL) algorithm (SimCLRv2) with supervised transfer learning on a Convolutional Neural Network (CNN) in geological image classification.
We tested the two approaches on a classification task of sediment disturbance from cores of International Ocean Drilling Program (IODP) Expeditions 383 and 385. Our results show that semi-supervised transfer learning can be an effective strategy to adopt, with SimCLRv2 capable of producing representations comparable to those of supervised transfer learning. However attempts to enhance the performance of semi-supervised transfer learning with task-specific unlabeled images during self-supervision degraded representations. Significantly, we demonstrate that SimCLRv2 trained on a dataset of core disturbance images can out-perform supervised transfer learning of a CNN once a critical number of task-specific unlabeled images are available for self-supervision. The gain in performance compared to supervised transfer learning is 1% and 3% for binary and multi-class classification, respectively.
Supervised transfer learning can be deployed with comparative ease, whereas the current SSL algorithms such as SimCLRv2 require more effort. We recommend that SSL be explored in cases when large amounts of unlabeled task-specific images exist and improvement of a few percent in metrics matter. When examining small, highly specialized datasets, without large amounts of unlabeled images, supervised transfer learning might be the best strategy to adopt. Overall, SSL is a promising approach and future work should explore this approach utilizing different dataset types, quantity, and quality.