{"title":"Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits","authors":"Anna Boser","doi":"10.1088/2515-7620/ad2e44","DOIUrl":null,"url":null,"abstract":"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> of 0.09 can falsely appear to achieve an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\"yes\">r</italic>\n<sup>2</sup> can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.","PeriodicalId":48496,"journal":{"name":"Environmental Research Communications","volume":"8 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Research Communications","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1088/2515-7620/ad2e44","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r2 of 0.09 can falsely appear to achieve an r2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.