Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits

IF 2.9 4区环境科学与生态学 Q3 ENVIRONMENTAL SCIENCES

Environmental Research Communications Pub Date : 2024-03-08 DOI:10.1088/2515-7620/ad2e44

Anna Boser

{"title":"Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits","authors":"Anna Boser","doi":"10.1088/2515-7620/ad2e44","DOIUrl":null,"url":null,"abstract":"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\"yes\">r</italic>\n2 of 0.09 can falsely appear to achieve an <italic toggle=\"yes\">r</italic>\n2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\"yes\">r</italic>\n2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.","PeriodicalId":48496,"journal":{"name":"Environmental Research Communications","volume":"8 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Research Communications","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1088/2515-7620/ad2e44","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r ² of 0.09 can falsely appear to achieve an r ² value of 0.73 by failing to account for Simpson’s paradox. This same model’s r ² can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.

查看原文本刊更多论文

验证时空环境机器学习模型：辛普森悖论和数据分割

机器学习通过估算空气质量、土地覆被类型、野生动物数量和疾病风险等稀缺环境数据，为环境科学带来了变革。然而，目前验证这些模型的方法往往忽略了环境数据中常见的空间或时间结构，导致对模型质量的评估不准确。本文概述了此类验证方法可能产生的问题，并介绍了如何避免对训练数据结构的错误假设。在一个关于空气质量估计的例子中，我们展示了一个 r2 值为 0.09 的差模型，由于没有考虑辛普森悖论（Simpson's paradox），其 r2 值看似达到了 0.73。如果数据分割不当，同一模型的 r2 值还会进一步上升到 0.82。为了确保在环境科学、司法和健康研究中使用高质量的合成数据，研究人员必须使用能够反映训练数据结构的验证程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Environmental Research Communications ENVIRONMENTAL SCIENCES-

CiteScore

3.50

自引率

0.00%

发文量

136