Evaluating Random Forest Model Performance for Cave and Sinkhole Prediction in the Cradle of Humankind, South Africa: Preliminary Analysis and Variable Importance Assessments
Margaret J. Furtner, Robert L. Anemone, Lei Wang, Juliet K. Brophy
{"title":"Evaluating Random Forest Model Performance for Cave and Sinkhole Prediction in the Cradle of Humankind, South Africa: Preliminary Analysis and Variable Importance Assessments","authors":"Margaret J. Furtner, Robert L. Anemone, Lei Wang, Juliet K. Brophy","doi":"10.1007/s10816-025-09761-1","DOIUrl":null,"url":null,"abstract":"Surveying an area for new fossil sites is a labor-intensive and resource-draining activity that can be alleviated with the aid of machine learning models. In karst landscapes of southern Africa, Plio-Pleistocene fossils that inform the paleoanthropological record are primarily found preserved in caves and sinkholes. The purpose of this study is to assess the utility of Random Forest (RF) models for cave and sinkhole prediction in the Cradle of Humankind, South Africa. Multispectral satellite imagery, digital elevation models (DEMs), and geologic maps were converted into raster (pixelated matrix) images in a GIS environment to denote varying aspects of the local topography, including elevation, slope, aspect, curvature, drainage, spectral reflectance, vegetation cover, fault proximity, and underlying geology. The rasters were stacked and overlaid with 1080 known cave and sinkhole locality points and 1080 random non-cave points in the study area for model training. Variable values associated with these geopoints were input into an RF model in Python for training and evaluation using a spatial ten-fold cross-validation. The model performed with 81.6% accuracy and an area under the curve (AUC) of 0.912. The importance of each variable for prediction was evaluated by measuring the increase in prediction error when variable values were shuffled. Distance to major faults, location within the Chuniespoort geologic group, dolomite presence, chert presence, and elevation exhibited the highest importance for model accuracy, while three out of 48 total predictor variables exhibited less importance than a randomly generated variable. The identification of important/unimportant variables will help build more efficient, robust models in future iterations, as well as help identify variables that could be useful in other karst regions.","PeriodicalId":47725,"journal":{"name":"Journal of Archaeological Method and Theory","volume":"60 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2026-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Archaeological Method and Theory","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1007/s10816-025-09761-1","RegionNum":1,"RegionCategory":"历史学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ANTHROPOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Surveying an area for new fossil sites is a labor-intensive and resource-draining activity that can be alleviated with the aid of machine learning models. In karst landscapes of southern Africa, Plio-Pleistocene fossils that inform the paleoanthropological record are primarily found preserved in caves and sinkholes. The purpose of this study is to assess the utility of Random Forest (RF) models for cave and sinkhole prediction in the Cradle of Humankind, South Africa. Multispectral satellite imagery, digital elevation models (DEMs), and geologic maps were converted into raster (pixelated matrix) images in a GIS environment to denote varying aspects of the local topography, including elevation, slope, aspect, curvature, drainage, spectral reflectance, vegetation cover, fault proximity, and underlying geology. The rasters were stacked and overlaid with 1080 known cave and sinkhole locality points and 1080 random non-cave points in the study area for model training. Variable values associated with these geopoints were input into an RF model in Python for training and evaluation using a spatial ten-fold cross-validation. The model performed with 81.6% accuracy and an area under the curve (AUC) of 0.912. The importance of each variable for prediction was evaluated by measuring the increase in prediction error when variable values were shuffled. Distance to major faults, location within the Chuniespoort geologic group, dolomite presence, chert presence, and elevation exhibited the highest importance for model accuracy, while three out of 48 total predictor variables exhibited less importance than a randomly generated variable. The identification of important/unimportant variables will help build more efficient, robust models in future iterations, as well as help identify variables that could be useful in other karst regions.
期刊介绍:
The Journal of Archaeological Method and Theory, the leading journal in its field, presents original articles that address method- or theory-focused issues of current archaeological interest and represent significant explorations on the cutting edge of the discipline. The journal also welcomes topical syntheses that critically assess and integrate research on a specific subject in archaeological method or theory, as well as examinations of the history of archaeology. Written by experts, the articles benefit an international audience of archaeologists, students of archaeology, and practitioners of closely related disciplines. Specific topics covered in recent issues include: the use of nitche construction theory in archaeology, new developments in the use of soil chemistry in archaeological interpretation, and a model for the prehistoric development of clothing. The Journal''s distinguished Editorial Board includes archaeologists with worldwide archaeological knowledge (the Americas, Asia and the Pacific, Europe, and Africa), and expertise in a wide range of methodological and theoretical issues. Rated ''A'' in the European Reference Index for the Humanities (ERIH) Journal of Archaeological Method and Theory is rated ''A'' in the ERIH, a new reference index that aims to help evenly access the scientific quality of Humanities research output. For more information visit: http://www.esf.org/research-areas/humanities/activities/research-infrastructures.html Rated ''A'' in the Australian Research Council Humanities and Creative Arts Journal List. For more information, visit: http://www.arc.gov.au/era/journal_list_dev.htm