Assessing the robustness and generalizability of machine learning models for predicting selenium content in rice: a case study from the Pearl River Delta and Eastern Guangdong, China.
Guiqi Ye, Tingting Li, Wenda Geng, Kun Qian, Xudong Ma, Qingye Hou, Tao Yu, Zhongfang Yang, Xin Zhu
{"title":"Assessing the robustness and generalizability of machine learning models for predicting selenium content in rice: a case study from the Pearl River Delta and Eastern Guangdong, China.","authors":"Guiqi Ye, Tingting Li, Wenda Geng, Kun Qian, Xudong Ma, Qingye Hou, Tao Yu, Zhongfang Yang, Xin Zhu","doi":"10.1007/s10653-025-02681-9","DOIUrl":null,"url":null,"abstract":"<p><p>Crop selenium uptake, influenced by complex factors, has prompted extensive research to predict the Se content in crop grains, leading to the development of various prediction methods. However, the practical application of these models is limited by geographical constraints and variations in independent variables. This study selected two distinct regions in Guangdong Province, China: the Pearl River Delta (PRD), a Quaternary plain region, and Heyuan, a hilly region characterized by outcrops of clastic rocks. A total of 205 paired rice and rhizosphere soil samples (PRD: 2016) and 60 paired samples (Heyuan: 2023) were collected to assess model robustness and generalizability. The results showed that 82.93% and 30.00% of soil Se ≥ 0.40 mg/kg and 72.68% and 38.33% of rice grain Se content ≥ 0.04 mg/kg were found in the PRD and Heyuan, respectively. However, no significant positive correlation was observed between soil Se and rice grain Se content in either area. Further studies found that the main influencing factors of rice grain Se content were soil SiO<sub>2</sub>, Al<sub>2</sub>O<sub>3</sub>, total organic carbon (TOC), S, and pH. The model was applied to the dataset for both time periods separately, yielded strong results, indicating that the model is robust and does not fluctuate greatly with the time of sample collection. The five feature subsets were used to predict the two regions separately with significant results. This indicates that the subset of predictive model features is highly generalizable, and the differences in the lithology of the soil parent materials and topography do not significantly affect the prediction results.</p>","PeriodicalId":11759,"journal":{"name":"Environmental Geochemistry and Health","volume":"47 9","pages":"382"},"PeriodicalIF":3.8000,"publicationDate":"2025-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Geochemistry and Health","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1007/s10653-025-02681-9","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}
引用次数: 0
Abstract
Crop selenium uptake, influenced by complex factors, has prompted extensive research to predict the Se content in crop grains, leading to the development of various prediction methods. However, the practical application of these models is limited by geographical constraints and variations in independent variables. This study selected two distinct regions in Guangdong Province, China: the Pearl River Delta (PRD), a Quaternary plain region, and Heyuan, a hilly region characterized by outcrops of clastic rocks. A total of 205 paired rice and rhizosphere soil samples (PRD: 2016) and 60 paired samples (Heyuan: 2023) were collected to assess model robustness and generalizability. The results showed that 82.93% and 30.00% of soil Se ≥ 0.40 mg/kg and 72.68% and 38.33% of rice grain Se content ≥ 0.04 mg/kg were found in the PRD and Heyuan, respectively. However, no significant positive correlation was observed between soil Se and rice grain Se content in either area. Further studies found that the main influencing factors of rice grain Se content were soil SiO2, Al2O3, total organic carbon (TOC), S, and pH. The model was applied to the dataset for both time periods separately, yielded strong results, indicating that the model is robust and does not fluctuate greatly with the time of sample collection. The five feature subsets were used to predict the two regions separately with significant results. This indicates that the subset of predictive model features is highly generalizable, and the differences in the lithology of the soil parent materials and topography do not significantly affect the prediction results.
期刊介绍:
Environmental Geochemistry and Health publishes original research papers and review papers across the broad field of environmental geochemistry. Environmental geochemistry and health establishes and explains links between the natural or disturbed chemical composition of the earth’s surface and the health of plants, animals and people.
Beneficial elements regulate or promote enzymatic and hormonal activity whereas other elements may be toxic. Bedrock geochemistry controls the composition of soil and hence that of water and vegetation. Environmental issues, such as pollution, arising from the extraction and use of mineral resources, are discussed. The effects of contaminants introduced into the earth’s geochemical systems are examined. Geochemical surveys of soil, water and plants show how major and trace elements are distributed geographically. Associated epidemiological studies reveal the possibility of causal links between the natural or disturbed geochemical environment and disease. Experimental research illuminates the nature or consequences of natural or disturbed geochemical processes.
The journal particularly welcomes novel research linking environmental geochemistry and health issues on such topics as: heavy metals (including mercury), persistent organic pollutants (POPs), and mixed chemicals emitted through human activities, such as uncontrolled recycling of electronic-waste; waste recycling; surface-atmospheric interaction processes (natural and anthropogenic emissions, vertical transport, deposition, and physical-chemical interaction) of gases and aerosols; phytoremediation/restoration of contaminated sites; food contamination and safety; environmental effects of medicines; effects and toxicity of mixed pollutants; speciation of heavy metals/metalloids; effects of mining; disturbed geochemistry from human behavior, natural or man-made hazards; particle and nanoparticle toxicology; risk and the vulnerability of populations, etc.