{"title":"Evaluating the role of training data origin for country-scale cropland mapping in data-scarce regions: A case study of Nigeria","authors":"Joaquin Gajardo , Michele Volpi , Daniel Onwude , Thijs Defraeye","doi":"10.1016/j.ophoto.2025.100091","DOIUrl":null,"url":null,"abstract":"<div><div>Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights about agricultural development without requiring extensive field surveys. While machine learning enables large-scale mapping, it relies on geo-referenced ground-truth data, which is time-consuming to collect, motivating efforts to integrate global datasets for mapping in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance in regions with limited local ground truth. To address this, we evaluate the impact of combining global and local datasets for cropland mapping in Nigeria at 10 m resolution. We manually labelled 1,827 data points evenly distributed across Nigeria and leveraged the crowd-sourced Geowiki dataset, evaluating three subsets of it: Nigeria, Nigeria + neighbouring countries, and worldwide. Using Google Earth Engine (GEE), we extracted multi-source time series data from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model (DEM) and compared Random Forest (RF) classifiers with Long Short-Term Memory (LSTM) networks, including a lightweight multi-task learning variant (multi-headed LSTM), previously applied to cropland mapping in other regions. Our findings highlight the importance of local training data, which consistently improved performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Models trained on Nigeria-only or regional datasets outperformed those trained on global data, except for the multi-headed LSTM, which uniquely benefited from global samples when local data was unavailable. A sensitivity analysis revealed that Sentinel-1, climate, and topographic data were particularly important, as their removal reduced accuracy by up to 0.154 and F1-score by 0.593. Handling class imbalance was also critical, with weighted loss functions improving accuracy by up to 0.071 for the single-headed LSTM. Our best-performing model, a single-headed LSTM trained on Nigeria-only data, achieved an F1-score of 0.814 and accuracy of 0.842, performing competitively with the best global land cover product and showing strong recall performance, a metric highly-relevant for food security applications. These results underscore the value of regionally focused training data, proper class imbalance handling, and multi-modal feature integration for improving cropland mapping in data-scarce regions. We release our data, source code, output maps, and an interactive GEE web application to facilitate further research.</div></div>","PeriodicalId":100730,"journal":{"name":"ISPRS Open Journal of Photogrammetry and Remote Sensing","volume":"17 ","pages":"Article 100091"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ISPRS Open Journal of Photogrammetry and Remote Sensing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667393225000109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Cropland maps are essential for remote sensing-based agricultural monitoring, providing timely insights about agricultural development without requiring extensive field surveys. While machine learning enables large-scale mapping, it relies on geo-referenced ground-truth data, which is time-consuming to collect, motivating efforts to integrate global datasets for mapping in data-scarce regions. A key challenge is understanding how the quantity, quality, and proximity of the training data to the target region influences model performance in regions with limited local ground truth. To address this, we evaluate the impact of combining global and local datasets for cropland mapping in Nigeria at 10 m resolution. We manually labelled 1,827 data points evenly distributed across Nigeria and leveraged the crowd-sourced Geowiki dataset, evaluating three subsets of it: Nigeria, Nigeria + neighbouring countries, and worldwide. Using Google Earth Engine (GEE), we extracted multi-source time series data from Sentinel-1, Sentinel-2, ERA5 climate, and a digital elevation model (DEM) and compared Random Forest (RF) classifiers with Long Short-Term Memory (LSTM) networks, including a lightweight multi-task learning variant (multi-headed LSTM), previously applied to cropland mapping in other regions. Our findings highlight the importance of local training data, which consistently improved performance, with accuracy gains up to 0.246 (RF) and 0.178 (LSTM). Models trained on Nigeria-only or regional datasets outperformed those trained on global data, except for the multi-headed LSTM, which uniquely benefited from global samples when local data was unavailable. A sensitivity analysis revealed that Sentinel-1, climate, and topographic data were particularly important, as their removal reduced accuracy by up to 0.154 and F1-score by 0.593. Handling class imbalance was also critical, with weighted loss functions improving accuracy by up to 0.071 for the single-headed LSTM. Our best-performing model, a single-headed LSTM trained on Nigeria-only data, achieved an F1-score of 0.814 and accuracy of 0.842, performing competitively with the best global land cover product and showing strong recall performance, a metric highly-relevant for food security applications. These results underscore the value of regionally focused training data, proper class imbalance handling, and multi-modal feature integration for improving cropland mapping in data-scarce regions. We release our data, source code, output maps, and an interactive GEE web application to facilitate further research.