{"title":"验证时空环境机器学习模型:辛普森悖论和数据分割","authors":"Anna Boser","doi":"10.1088/2515-7620/ad2e44","DOIUrl":null,"url":null,"abstract":"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> of 0.09 can falsely appear to achieve an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\"yes\">r</italic>\n<sup>2</sup> can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.","PeriodicalId":48496,"journal":{"name":"Environmental Research Communications","volume":"8 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits\",\"authors\":\"Anna Boser\",\"doi\":\"10.1088/2515-7620/ad2e44\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> of 0.09 can falsely appear to achieve an <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.\",\"PeriodicalId\":48496,\"journal\":{\"name\":\"Environmental Research Communications\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2024-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Research Communications\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1088/2515-7620/ad2e44\",\"RegionNum\":4,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Research Communications","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1088/2515-7620/ad2e44","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits
Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r2 of 0.09 can falsely appear to achieve an r2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.