Adela Sobotkova, R. Kristensen-Mclachlan, Orla Mallon, Shawn A. Ross
{"title":"Validating predictions of burial mounds with field data: the promise and reality of machine learning","authors":"Adela Sobotkova, R. Kristensen-Mclachlan, Orla Mallon, Shawn A. Ross","doi":"10.1108/jd-05-2022-0096","DOIUrl":null,"url":null,"abstract":"PurposeThis paper provides practical advice for archaeologists and heritage specialists wishing to use ML approaches to identify archaeological features in high-resolution satellite imagery (or other remotely sensed data sources). We seek to balance the disproportionately optimistic literature related to the application of ML to archaeological prospection through a discussion of limitations, challenges and other difficulties. We further seek to raise awareness among researchers of the time, effort, expertise and resources necessary to implement ML successfully, so that they can make an informed choice between ML and manual inspection approaches.Design/methodology/approachAutomated object detection has been the holy grail of archaeological remote sensing for the last two decades. Machine learning (ML) models have proven able to detect uniform features across a consistent background, but more variegated imagery remains a challenge. We set out to detect burial mounds in satellite imagery from a diverse landscape in Central Bulgaria using a pre-trained Convolutional Neural Network (CNN) plus additional but low-touch training to improve performance. Training was accomplished using MOUND/NOT MOUND cutouts, and the model assessed arbitrary tiles of the same size from the image. Results were assessed using field data.FindingsValidation of results against field data showed that self-reported success rates were misleadingly high, and that the model was misidentifying most features. Setting an identification threshold at 60% probability, and noting that we used an approach where the CNN assessed tiles of a fixed size, tile-based false negative rates were 95–96%, false positive rates were 87–95% of tagged tiles, while true positives were only 5–13%. Counterintuitively, the model provided with training data selected for highly visible mounds (rather than all mounds) performed worse. Development of the model, meanwhile, required approximately 135 person-hours of work.Research limitations/implicationsOur attempt to deploy a pre-trained CNN demonstrates the limitations of this approach when it is used to detect varied features of different sizes within a heterogeneous landscape that contains confounding natural and modern features, such as roads, forests and field boundaries. The model has detected incidental features rather than the mounds themselves, making external validation with field data an essential part of CNN workflows. Correcting the model would require refining the training data as well as adopting different approaches to model choice and execution, raising the computational requirements beyond the level of most cultural heritage practitioners.Practical implicationsImproving the pre-trained model’s performance would require considerable time and resources, on top of the time already invested. The degree of manual intervention required – particularly around the subsetting and annotation of training data – is so significant that it raises the question of whether it would be more efficient to identify all of the mounds manually, either through brute-force inspection by experts or by crowdsourcing the analysis to trained – or even untrained – volunteers. Researchers and heritage specialists seeking efficient methods for extracting features from remotely sensed data should weigh the costs and benefits of ML versus manual approaches carefully.Social implicationsOur literature review indicates that use of artificial intelligence (AI) and ML approaches to archaeological prospection have grown exponentially in the past decade, approaching adoption levels associated with “crossing the chasm” from innovators and early adopters to the majority of researchers. The literature itself, however, is overwhelmingly positive, reflecting some combination of publication bias and a rhetoric of unconditional success. This paper presents the failure of a good-faith attempt to utilise these approaches as a counterbalance and cautionary tale to potential adopters of the technology. Early-majority adopters may find ML difficult to implement effectively in real-life scenarios.Originality/valueUnlike many high-profile reports from well-funded projects, our paper represents a serious but modestly resourced attempt to apply an ML approach to archaeological remote sensing, using techniques like transfer learning that are promoted as solutions to time and cost problems associated with, e.g. annotating and manipulating training data. While the majority of articles uncritically promote ML, or only discuss how challenges were overcome, our paper investigates how – despite reasonable self-reported scores – the model failed to locate the target features when compared to field data. We also present time, expertise and resourcing requirements, a rarity in ML-for-archaeology publications.","PeriodicalId":506264,"journal":{"name":"Journal of Documentation","volume":"28 11","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Documentation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/jd-05-2022-0096","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
PurposeThis paper provides practical advice for archaeologists and heritage specialists wishing to use ML approaches to identify archaeological features in high-resolution satellite imagery (or other remotely sensed data sources). We seek to balance the disproportionately optimistic literature related to the application of ML to archaeological prospection through a discussion of limitations, challenges and other difficulties. We further seek to raise awareness among researchers of the time, effort, expertise and resources necessary to implement ML successfully, so that they can make an informed choice between ML and manual inspection approaches.Design/methodology/approachAutomated object detection has been the holy grail of archaeological remote sensing for the last two decades. Machine learning (ML) models have proven able to detect uniform features across a consistent background, but more variegated imagery remains a challenge. We set out to detect burial mounds in satellite imagery from a diverse landscape in Central Bulgaria using a pre-trained Convolutional Neural Network (CNN) plus additional but low-touch training to improve performance. Training was accomplished using MOUND/NOT MOUND cutouts, and the model assessed arbitrary tiles of the same size from the image. Results were assessed using field data.FindingsValidation of results against field data showed that self-reported success rates were misleadingly high, and that the model was misidentifying most features. Setting an identification threshold at 60% probability, and noting that we used an approach where the CNN assessed tiles of a fixed size, tile-based false negative rates were 95–96%, false positive rates were 87–95% of tagged tiles, while true positives were only 5–13%. Counterintuitively, the model provided with training data selected for highly visible mounds (rather than all mounds) performed worse. Development of the model, meanwhile, required approximately 135 person-hours of work.Research limitations/implicationsOur attempt to deploy a pre-trained CNN demonstrates the limitations of this approach when it is used to detect varied features of different sizes within a heterogeneous landscape that contains confounding natural and modern features, such as roads, forests and field boundaries. The model has detected incidental features rather than the mounds themselves, making external validation with field data an essential part of CNN workflows. Correcting the model would require refining the training data as well as adopting different approaches to model choice and execution, raising the computational requirements beyond the level of most cultural heritage practitioners.Practical implicationsImproving the pre-trained model’s performance would require considerable time and resources, on top of the time already invested. The degree of manual intervention required – particularly around the subsetting and annotation of training data – is so significant that it raises the question of whether it would be more efficient to identify all of the mounds manually, either through brute-force inspection by experts or by crowdsourcing the analysis to trained – or even untrained – volunteers. Researchers and heritage specialists seeking efficient methods for extracting features from remotely sensed data should weigh the costs and benefits of ML versus manual approaches carefully.Social implicationsOur literature review indicates that use of artificial intelligence (AI) and ML approaches to archaeological prospection have grown exponentially in the past decade, approaching adoption levels associated with “crossing the chasm” from innovators and early adopters to the majority of researchers. The literature itself, however, is overwhelmingly positive, reflecting some combination of publication bias and a rhetoric of unconditional success. This paper presents the failure of a good-faith attempt to utilise these approaches as a counterbalance and cautionary tale to potential adopters of the technology. Early-majority adopters may find ML difficult to implement effectively in real-life scenarios.Originality/valueUnlike many high-profile reports from well-funded projects, our paper represents a serious but modestly resourced attempt to apply an ML approach to archaeological remote sensing, using techniques like transfer learning that are promoted as solutions to time and cost problems associated with, e.g. annotating and manipulating training data. While the majority of articles uncritically promote ML, or only discuss how challenges were overcome, our paper investigates how – despite reasonable self-reported scores – the model failed to locate the target features when compared to field data. We also present time, expertise and resourcing requirements, a rarity in ML-for-archaeology publications.
本文为希望使用 ML 方法识别高分辨率卫星图像(或其他遥感数据源)中考古特征的考古学家和遗产专家提供实用建议。我们试图通过对局限性、挑战和其他困难的讨论,平衡与应用 ML 进行考古勘探相关的过于乐观的文献。我们还试图提高研究人员对成功实施 ML 所需的时间、精力、专业知识和资源的认识,以便他们能够在 ML 和人工检测方法之间做出明智的选择。事实证明,机器学习(ML)模型能够在一致的背景中检测出统一的特征,但更多变的图像仍然是一个挑战。我们利用预先训练好的卷积神经网络(CNN),再加上额外的低接触训练来提高性能,从而探测保加利亚中部多样化地貌卫星图像中的土墩。训练使用 "坟丘/非坟丘 "切面完成,模型对图像中相同大小的任意瓦片进行评估。结果根据现场数据对结果进行了验证,结果表明自我报告的成功率过高,而且模型对大多数特征的识别都存在误差。将识别阈值设定为 60%,并注意到我们使用了一种 CNN 评估固定大小瓦片的方法,基于瓦片的假阴性率为 95-96%,标记瓦片的假阳性率为 87-95%,而真阳性率仅为 5-13%。与直觉相反的是,为模型提供的训练数据只针对高度可见的土墩(而不是所有土墩),因此模型的表现更差。研究的局限性/意义我们尝试使用预先训练好的 CNN 来检测异质地貌中不同大小的各种地物时,发现了这种方法的局限性,因为异质地貌中包含道路、森林和田地边界等自然和现代地物。该模型检测到的是附带特征而非土丘本身,因此利用实地数据进行外部验证是 CNN 工作流程的重要组成部分。修正模型需要完善训练数据,并采用不同的方法来选择和执行模型,这就提高了计算要求,超出了大多数文化遗产从业人员的水平。所需的人工干预程度--尤其是围绕训练数据的子集和注释--是如此之大,以至于提出了这样一个问题:是否由专家通过粗暴的检查,或者由受过训练--甚至未经训练--的志愿者进行众包分析,从而更有效地人工识别所有土墩。研究人员和遗产专家在寻求从遥感数据中提取特征的高效方法时,应仔细权衡人工智能与人工方法的成本和效益。 社会影响我们的文献综述表明,过去十年中,人工智能(AI)和人工智能方法在考古勘探中的使用呈指数级增长,接近从创新者和早期采用者到大多数研究人员 "跨越鸿沟 "的采用水平。然而,文献本身绝大多数都是正面的,这反映了某种出版偏见和无条件成功言论的结合。本文介绍了利用这些方法的善意尝试的失败,以此作为对潜在技术采用者的制衡和警示。原创性/价值与许多资金雄厚项目的高调报告不同,我们的论文代表了一种严肃但资源有限的尝试,即利用迁移学习等技术,将 ML 方法应用于考古遥感,以解决与标注和处理训练数据等相关的时间和成本问题。大多数文章不加批判地推广 ML,或仅讨论如何克服挑战,而我们的论文则调查了尽管自我报告的分数合理,但与实地数据相比,模型如何未能定位目标特征。我们还介绍了时间、专业知识和资源配置方面的要求,这在以 ML 为基础的考古学出版物中实属罕见。