Safer and Smarter: Leveraging Interpretation-Guided Modeling and Data Merging of Disease and Environmental Data for Plant Disease Risk Prediction.

IF 3.1 2区农林科学 Q2 PLANT SCIENCES

Phytopathology Pub Date : 2025-09-26 DOI:10.1094/PHYTO-01-25-0008-FI

Kaique S Alves, Denis A Shah, Helene R Dillard, Emerson M Del Ponte, Sarah J Pethybridge

{"title":"Safer and Smarter: Leveraging Interpretation-Guided Modeling and Data Merging of Disease and Environmental Data for Plant Disease Risk Prediction.","authors":"Kaique S Alves, Denis A Shah, Helene R Dillard, Emerson M Del Ponte, Sarah J Pethybridge","doi":"10.1094/PHYTO-01-25-0008-FI","DOIUrl":null,"url":null,"abstract":"Plant disease epidemiologists often work with datasets smaller than ideal for data-hungry machine-learning (ML) algorithms, thereby risking overfitting. We demonstrate how an interpretation-guided modeling approach, leveraging complex ML primarily for insight generation, can overcome this challenge, using white mold (caused by Sclerotinia sclerotiorum) in snap beans (Phaseolus vulgaris) as a case study. An observational dataset of white mold prevalence across 356 commercial snap bean fields in central and western New York State (2006 to 2008) was augmented by merging georeferenced observations with POLARIS soils data and engineered features from downscaled ERA5-Land environmental data. Functional data analysis identified weather periods associated with white mold risk, and random forests (RFs), used interpretatively, identified key predictors. Although RF models showed high apparent performance, they exhibited significant overfitting and poor calibration. Insights from RF interpretation (via SHapley Additive exPlanations analysis) guided the development of a simpler, four-predictor logistic regression model using restricted cubic splines. This simpler model was better calibrated and had acceptable discrimination (internally validated C statistic = 0.77). For smaller epidemiological datasets, our results advocate for using ML primarily as an interpretive tool to guide the development of simpler, less data-intensive, yet robust predictive models better suited for practical disease management decisions.","PeriodicalId":20410,"journal":{"name":"Phytopathology","volume":" ","pages":"PHYTO01250008FI"},"PeriodicalIF":3.1000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Phytopathology","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.1094/PHYTO-01-25-0008-FI","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Plant disease epidemiologists often work with datasets smaller than ideal for data-hungry machine-learning (ML) algorithms, thereby risking overfitting. We demonstrate how an interpretation-guided modeling approach, leveraging complex ML primarily for insight generation, can overcome this challenge, using white mold (caused by Sclerotinia sclerotiorum) in snap beans (Phaseolus vulgaris) as a case study. An observational dataset of white mold prevalence across 356 commercial snap bean fields in central and western New York State (2006 to 2008) was augmented by merging georeferenced observations with POLARIS soils data and engineered features from downscaled ERA5-Land environmental data. Functional data analysis identified weather periods associated with white mold risk, and random forests (RFs), used interpretatively, identified key predictors. Although RF models showed high apparent performance, they exhibited significant overfitting and poor calibration. Insights from RF interpretation (via SHapley Additive exPlanations analysis) guided the development of a simpler, four-predictor logistic regression model using restricted cubic splines. This simpler model was better calibrated and had acceptable discrimination (internally validated C statistic = 0.77). For smaller epidemiological datasets, our results advocate for using ML primarily as an interpretive tool to guide the development of simpler, less data-intensive, yet robust predictive models better suited for practical disease management decisions.

查看原文本刊更多论文

更安全，更智能：利用解释导向的疾病和环境数据建模和数据合并进行植物疾病风险预测。

植物疾病流行病学家经常使用比数据饥渴型机器学习（ML）算法更小的数据集，从而有过拟合的风险。我们以菜豆（Phaseolus vulgaris）中的白霉菌（由菌核菌引起）为例，展示了一种以解释为指导的建模方法，主要利用复杂的ML来产生洞察力，如何克服这一挑战。通过将地理参考观测数据与POLARIS土壤数据和缩小ERA5-Land环境数据的工程特征相结合，增强了纽约州中部和西部356个商业豆荚田（2006-2008年）白霉菌流行率的观测数据集。功能数据分析确定了与白霉菌风险相关的天气期，而随机森林（RF）用于解释，确定了关键预测因子。虽然射频模型表现出很高的表观性能，但它们表现出明显的过拟合和较差的校准。来自RF解释（通过SHapley加性解释分析）的见解指导了使用受限三次样条的更简单的四预测器逻辑回归模型的开发。这个简单的模型具有更好的校准和可接受的鉴别（内部验证的C-statistic = 0.77）。对于较小的流行病学数据集，我们的研究结果主张将ML主要用作解释工具，以指导开发更简单、数据密集度更低、但更适合实际疾病管理决策的鲁棒预测模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Phytopathology 生物-植物科学

CiteScore

5.90

自引率

9.40%

发文量

505

审稿时长

4-8 weeks

期刊介绍： Phytopathology publishes articles on fundamental research that advances understanding of the nature of plant diseases, the agents that cause them, their spread, the losses they cause, and measures that can be used to control them. Phytopathology considers manuscripts covering all aspects of plant diseases including bacteriology, host-parasite biochemistry and cell biology, biological control, disease control and pest management, description of new pathogen species description of new pathogen species, ecology and population biology, epidemiology, disease etiology, host genetics and resistance, mycology, nematology, plant stress and abiotic disorders, postharvest pathology and mycotoxins, and virology. Papers dealing mainly with taxonomy, such as descriptions of new plant pathogen taxa are acceptable if they include plant disease research results such as pathogenicity, host range, etc. Taxonomic papers that focus on classification, identification, and nomenclature below the subspecies level may also be submitted to Phytopathology.