Kaique S Alves, Denis A Shah, Helene R Dillard, Emerson M Del Ponte, Sarah J Pethybridge
{"title":"Safer and Smarter: Leveraging Interpretation-Guided Modeling and Data Merging of Disease and Environmental Data for Plant Disease Risk Prediction.","authors":"Kaique S Alves, Denis A Shah, Helene R Dillard, Emerson M Del Ponte, Sarah J Pethybridge","doi":"10.1094/PHYTO-01-25-0008-FI","DOIUrl":null,"url":null,"abstract":"<p><p>Plant disease epidemiologists often work with datasets smaller than ideal for data-hungry machine-learning (ML) algorithms, thereby risking overfitting. We demonstrate how an interpretation-guided modeling approach, leveraging complex ML primarily for insight generation, can overcome this challenge, using white mold (caused by <i>Sclerotinia sclerotiorum</i>) in snap beans (<i>Phaseolus vulgaris</i>) as a case study. An observational dataset of white mold prevalence across 356 commercial snap bean fields in central and western New York State (2006 to 2008) was augmented by merging georeferenced observations with POLARIS soils data and engineered features from downscaled ERA5-Land environmental data. Functional data analysis identified weather periods associated with white mold risk, and random forests (RFs), used interpretatively, identified key predictors. Although RF models showed high apparent performance, they exhibited significant overfitting and poor calibration. Insights from RF interpretation (via SHapley Additive exPlanations analysis) guided the development of a simpler, four-predictor logistic regression model using restricted cubic splines. This simpler model was better calibrated and had acceptable discrimination (internally validated <i>C</i> statistic = 0.77). For smaller epidemiological datasets, our results advocate for using ML primarily as an interpretive tool to guide the development of simpler, less data-intensive, yet robust predictive models better suited for practical disease management decisions.</p>","PeriodicalId":20410,"journal":{"name":"Phytopathology","volume":" ","pages":"PHYTO01250008FI"},"PeriodicalIF":3.1000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Phytopathology","FirstCategoryId":"97","ListUrlMain":"https://doi.org/10.1094/PHYTO-01-25-0008-FI","RegionNum":2,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PLANT SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Plant disease epidemiologists often work with datasets smaller than ideal for data-hungry machine-learning (ML) algorithms, thereby risking overfitting. We demonstrate how an interpretation-guided modeling approach, leveraging complex ML primarily for insight generation, can overcome this challenge, using white mold (caused by Sclerotinia sclerotiorum) in snap beans (Phaseolus vulgaris) as a case study. An observational dataset of white mold prevalence across 356 commercial snap bean fields in central and western New York State (2006 to 2008) was augmented by merging georeferenced observations with POLARIS soils data and engineered features from downscaled ERA5-Land environmental data. Functional data analysis identified weather periods associated with white mold risk, and random forests (RFs), used interpretatively, identified key predictors. Although RF models showed high apparent performance, they exhibited significant overfitting and poor calibration. Insights from RF interpretation (via SHapley Additive exPlanations analysis) guided the development of a simpler, four-predictor logistic regression model using restricted cubic splines. This simpler model was better calibrated and had acceptable discrimination (internally validated C statistic = 0.77). For smaller epidemiological datasets, our results advocate for using ML primarily as an interpretive tool to guide the development of simpler, less data-intensive, yet robust predictive models better suited for practical disease management decisions.
期刊介绍:
Phytopathology publishes articles on fundamental research that advances understanding of the nature of plant diseases, the agents that cause them, their spread, the losses they cause, and measures that can be used to control them. Phytopathology considers manuscripts covering all aspects of plant diseases including bacteriology, host-parasite biochemistry and cell biology, biological control, disease control and pest management, description of new pathogen species description of new pathogen species, ecology and population biology, epidemiology, disease etiology, host genetics and resistance, mycology, nematology, plant stress and abiotic disorders, postharvest pathology and mycotoxins, and virology. Papers dealing mainly with taxonomy, such as descriptions of new plant pathogen taxa are acceptable if they include plant disease research results such as pathogenicity, host range, etc. Taxonomic papers that focus on classification, identification, and nomenclature below the subspecies level may also be submitted to Phytopathology.