验证时空环境机器学习模型:辛普森悖论和数据分割

IF 2.5 4区 环境科学与生态学 Q3 ENVIRONMENTAL SCIENCES
Anna Boser
{"title":"验证时空环境机器学习模型:辛普森悖论和数据分割","authors":"Anna Boser","doi":"10.1088/2515-7620/ad2e44","DOIUrl":null,"url":null,"abstract":"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> of 0.09 can falsely appear to achieve an <italic toggle=\"yes\">r</italic>\n<sup>2</sup> value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\"yes\">r</italic>\n<sup>2</sup> can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.","PeriodicalId":48496,"journal":{"name":"Environmental Research Communications","volume":"8 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits\",\"authors\":\"Anna Boser\",\"doi\":\"10.1088/2515-7620/ad2e44\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> of 0.09 can falsely appear to achieve an <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> value of 0.73 by failing to account for Simpson’s paradox. This same model’s <italic toggle=\\\"yes\\\">r</italic>\\n<sup>2</sup> can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.\",\"PeriodicalId\":48496,\"journal\":{\"name\":\"Environmental Research Communications\",\"volume\":\"8 1\",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2024-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Research Communications\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1088/2515-7620/ad2e44\",\"RegionNum\":4,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Research Communications","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1088/2515-7620/ad2e44","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0

摘要

机器学习通过估算空气质量、土地覆被类型、野生动物数量和疾病风险等稀缺环境数据,为环境科学带来了变革。然而,目前验证这些模型的方法往往忽略了环境数据中常见的空间或时间结构,导致对模型质量的评估不准确。本文概述了此类验证方法可能产生的问题,并介绍了如何避免对训练数据结构的错误假设。在一个关于空气质量估计的例子中,我们展示了一个 r2 值为 0.09 的差模型,由于没有考虑辛普森悖论(Simpson's paradox),其 r2 值看似达到了 0.73。如果数据分割不当,同一模型的 r2 值还会进一步上升到 0.82。为了确保在环境科学、司法和健康研究中使用高质量的合成数据,研究人员必须使用能够反映训练数据结构的验证程序。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits
Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r 2 of 0.09 can falsely appear to achieve an r 2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r 2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Environmental Research Communications
Environmental Research Communications ENVIRONMENTAL SCIENCES-
CiteScore
3.50
自引率
0.00%
发文量
136
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信