Evaluating the role of training data origin for country-scale cropland mapping in data-scarce regions: A case study of Nigeria

ISPRS Open Journal of Photogrammetry and Remote Sensing Pub Date : 2025-07-09 DOI:10.1016/j.ophoto.2025.100091

Joaquin Gajardo , Michele Volpi , Daniel Onwude , Thijs Defraeye

{"title":"Evaluating the role of training data origin for country-scale cropland mapping in data-scarce regions: A case study of Nigeria","authors":"Joaquin Gajardo , Michele Volpi , Daniel Onwude , Thijs Defraeye","doi":"10.1016/j.ophoto.2025.100091","DOIUrl":null,"url":null,"abstract":"<div><div>Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights about agricultural development without requiring extensive field surveys. While machine learning enables large-scale mapping, it relies on geo-referenced ground-truth data, which is time-consuming to collect, motivating efforts to integrate global datasets for mapping in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance in regions with limited local ground truth. To address this, we evaluate the impact of combining global and local datasets for cropland mapping in Nigeria at 10 m resolution. We manually labelled 1,827 data points evenly distributed across Nigeria and leveraged the crowd-sourced Geowiki dataset, evaluating three subsets of it: Nigeria, Nigeria + neighbouring countries, and worldwide. Using Google Earth Engine (GEE), we extracted multi-source time series data from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model (DEM) and compared Random Forest (RF) classifiers with Long Short-Term Memory (LSTM) networks, including a lightweight multi-task learning variant (multi-headed LSTM), previously applied to cropland mapping in other regions. Our findings highlight the importance of local training data, which consistently improved performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Models trained on Nigeria-only or regional datasets outperformed those trained on global data, except for the multi-headed LSTM, which uniquely benefited from global samples when local data was unavailable. A sensitivity analysis revealed that Sentinel-1, climate, and topographic data were particularly important, as their removal reduced accuracy by up to 0.154 and F1-score by 0.593. Handling class imbalance was also critical, with weighted loss functions improving accuracy by up to 0.071 for the single-headed LSTM. Our best-performing model, a single-headed LSTM trained on Nigeria-only data, achieved an F1-score of 0.814 and accuracy of 0.842, performing competitively with the best global land cover product and showing strong recall performance, a metric highly-relevant for food security applications. These results underscore the value of regionally focused training data, proper class imbalance handling, and multi-modal feature integration for improving cropland mapping in data-scarce regions. We release our data, source code, output maps, and an interactive GEE web application to facilitate further research.</div></div>","PeriodicalId":100730,"journal":{"name":"ISPRS Open Journal of Photogrammetry and Remote Sensing","volume":"17 ","pages":"Article 100091"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Open Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667393225000109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights about agricultural development without requiring extensive field surveys. While machine learning enables large-scale mapping, it relies on geo-referenced ground-truth data, which is time-consuming to collect, motivating efforts to integrate global datasets for mapping in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance in regions with limited local ground truth. To address this, we evaluate the impact of combining global and local datasets for cropland mapping in Nigeria at 10 m resolution. We manually labelled 1,827 data points evenly distributed across Nigeria and leveraged the crowd-sourced Geowiki dataset, evaluating three subsets of it: Nigeria, Nigeria + neighbouring countries, and worldwide. Using Google Earth Engine (GEE), we extracted multi-source time series data from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model (DEM) and compared Random Forest (RF) classifiers with Long Short-Term Memory (LSTM) networks, including a lightweight multi-task learning variant (multi-headed LSTM), previously applied to cropland mapping in other regions. Our findings highlight the importance of local training data, which consistently improved performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Models trained on Nigeria-only or regional datasets outperformed those trained on global data, except for the multi-headed LSTM, which uniquely benefited from global samples when local data was unavailable. A sensitivity analysis revealed that Sentinel-1, climate, and topographic data were particularly important, as their removal reduced accuracy by up to 0.154 and F1-score by 0.593. Handling class imbalance was also critical, with weighted loss functions improving accuracy by up to 0.071 for the single-headed LSTM. Our best-performing model, a single-headed LSTM trained on Nigeria-only data, achieved an F1-score of 0.814 and accuracy of 0.842, performing competitively with the best global land cover product and showing strong recall performance, a metric highly-relevant for food security applications. These results underscore the value of regionally focused training data, proper class imbalance handling, and multi-modal feature integration for improving cropland mapping in data-scarce regions. We release our data, source code, output maps, and an interactive GEE web application to facilitate further research.

查看原文本刊更多论文

评估培训数据来源在数据匮乏地区的国家尺度农田制图中的作用：以尼日利亚为例

农田地图对于基于遥感的农业监测至关重要，它提供了关于农业发展的及时见解，而不需要广泛的实地调查。虽然机器学习可以实现大规模制图，但它依赖于地理参考的真实数据，这些数据的收集非常耗时，这促使人们努力整合全球数据集，以便在数据稀缺的地区进行制图。一个关键的挑战是理解训练数据与目标区域的数量、质量和接近程度如何影响局部地面真值有限的区域的模型性能。为了解决这个问题，我们评估了将全球和当地数据集结合起来以10米分辨率在尼日利亚进行农田测绘的影响。我们手动标记了均匀分布在尼日利亚的1827个数据点，并利用众包的Geowiki数据集，评估了其中的三个子集：尼日利亚、尼日利亚+邻国和全球。利用谷歌Earth Engine （GEE）提取了来自Sentinel-1、Sentinel-2、ERA5气候和数字高程模型（DEM）的多源时间序列数据，并将随机森林（RF）分类器与长短期记忆（LSTM）网络进行了比较，其中包括轻量级多任务学习变量（multihead LSTM），该方法此前已应用于其他地区的农田测绘。我们的研究结果强调了局部训练数据的重要性，它持续提高了性能，准确率提高了0.246 （RF）和0.178 （LSTM）。仅在尼日利亚或区域数据集上训练的模型优于在全球数据集上训练的模型，但多头LSTM除外，当本地数据不可用时，多头LSTM唯一受益于全球样本。敏感性分析显示，Sentinel-1、气候和地形数据尤其重要，因为它们的移除使精度降低了0.154，F1-score降低了0.593。处理类不平衡也很关键，对于单头LSTM，加权损失函数将准确率提高了0.071。我们表现最好的模型是仅在尼日利亚数据上训练的单头LSTM，其f1得分为0.814，准确率为0.842，与全球最佳土地覆盖产品竞争，并显示出强大的召回性能，这是一个与粮食安全应用高度相关的指标。这些结果强调了以区域为重点的训练数据、适当的类不平衡处理和多模式特征集成对于改善数据稀缺地区的农田制图的价值。我们发布了我们的数据、源代码、输出地图和一个交互式的GEE web应用程序，以促进进一步的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ISPRS Open Journal of Photogrammetry and Remote Sensing

CiteScore

5.10

自引率

0.00%

发文量