Reconstructing missing data by comparing interpolation techniques: Applications for long-term water quality data

IF 2.1 3区地球科学 Q2 LIMNOLOGY

Limnology and Oceanography: Methods Pub Date : 2023-05-30 DOI:10.1002/lom3.10556

Danelle M. Larson, Wako Bungula, Amber Lee, Alaina Stockdill, Casey McKean, Frederick "Forrest" Miller, Killian Davis, Richard A. Erickson, Enrika Hlavacek

{"title":"Reconstructing missing data by comparing interpolation techniques: Applications for long-term water quality data","authors":"Danelle M. Larson, Wako Bungula, Amber Lee, Alaina Stockdill, Casey McKean, Frederick \"Forrest\" Miller, Killian Davis, Richard A. Erickson, Enrika Hlavacek","doi":"10.1002/lom3.10556","DOIUrl":null,"url":null,"abstract":"Missing data are typical yet must be addressed for proper inferences or expanding datasets to guide our limnological understanding and management of aquatic systems. Interpolation methods (i.e., estimating missing values using known values within the dataset) can alleviate data gaps and common problems. We compared seven popular interpolation methods for predicting substantial missingness in a long-term water quality dataset from the Upper Mississippi River, U.S.A. The dataset included 80,000 sampling sites collected over 30 yr that had substantial missingness for total nitrogen (TN), total phosphorus (TP), and water velocity. For all three interpolated water quality variables, random forests had very high prediction accuracy and outperformed the methods of ordinary kriging, polynomial regressions, regression trees, and inverse distance weighting. TP had a mean absolute error (MAE) of 0.03 mg (L-TP)−1, TN had a MAE of 0.39 mg (L-TN)−1, and water velocity had a MAE of 0.10 m s−1. The random forests' error rates were mapped and showed low spatiotemporal variability across the riverscape, indicating high model performance across many habitat types and large spatial scales. In the current era of “big data,” interpolation becomes an imperative step prior to ecological analyses yet remains unfamiliar and underutilized. Our research briefly describes the importance of addressing missingness and provides a roadmap to conduct model intercomparisons of other big datasets. We also share adaptable data analysis scripts, which allows others to readily conduct interpolation comparisons for many limnology applications and contexts.","PeriodicalId":18145,"journal":{"name":"Limnology and Oceanography: Methods","volume":"21 7","pages":"435-449"},"PeriodicalIF":2.1000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/lom3.10556","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Limnology and Oceanography: Methods","FirstCategoryId":"89","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/lom3.10556","RegionNum":3,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"LIMNOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Missing data are typical yet must be addressed for proper inferences or expanding datasets to guide our limnological understanding and management of aquatic systems. Interpolation methods (i.e., estimating missing values using known values within the dataset) can alleviate data gaps and common problems. We compared seven popular interpolation methods for predicting substantial missingness in a long-term water quality dataset from the Upper Mississippi River, U.S.A. The dataset included 80,000 sampling sites collected over 30 yr that had substantial missingness for total nitrogen (TN), total phosphorus (TP), and water velocity. For all three interpolated water quality variables, random forests had very high prediction accuracy and outperformed the methods of ordinary kriging, polynomial regressions, regression trees, and inverse distance weighting. TP had a mean absolute error (MAE) of 0.03 mg (L-TP)⁻¹, TN had a MAE of 0.39 mg (L-TN)⁻¹, and water velocity had a MAE of 0.10 m s⁻¹. The random forests' error rates were mapped and showed low spatiotemporal variability across the riverscape, indicating high model performance across many habitat types and large spatial scales. In the current era of “big data,” interpolation becomes an imperative step prior to ecological analyses yet remains unfamiliar and underutilized. Our research briefly describes the importance of addressing missingness and provides a roadmap to conduct model intercomparisons of other big datasets. We also share adaptable data analysis scripts, which allows others to readily conduct interpolation comparisons for many limnology applications and contexts.

Abstract Image

查看原文本刊更多论文

通过比较插值技术重建缺失数据：在长期水质数据中的应用

缺失的数据是典型的，但必须加以解决，以便进行适当的推断或扩展数据集，以指导我们对水生系统的湖沼学理解和管理。插值方法（即，使用数据集中的已知值估计缺失值）可以缓解数据差距和常见问题。在美国密西西比河上游的长期水质数据集中，我们比较了七种常用的插值方法来预测大量缺失。该数据集包括30多个采集的80000个采样点对总氮（TN）、总磷（TP）和水流速度有显著损失的年。对于所有三个插值的水质变量，随机森林具有非常高的预测精度，并且优于普通克里格法、多项式回归法、回归树和反距离加权法。TP的平均绝对误差（MAE）为0.03 毫克（L-TP）−1，TN的MAE为0.39 毫克（L-TN）−1，水速度的MAE为0.10 m s−1.绘制了随机森林的误差率，并在整个河流景观中显示出较低的时空变异性，表明在许多栖息地类型和大空间尺度上具有较高的模型性能。在当前的“大数据”时代，插值成为生态分析之前必不可少的一步，但仍然不熟悉且未得到充分利用。我们的研究简要描述了解决缺失问题的重要性，并提供了对其他大型数据集进行模型相互比较的路线图。我们还共享适应性强的数据分析脚本，这使其他人能够轻松地对许多湖沼学应用和环境进行插值比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Limnology and Oceanography: Methods 地学-海洋学

CiteScore

4.80

自引率

3.70%

发文量

审稿时长

3 months

期刊介绍： Limnology and Oceanography: Methods (ISSN 1541-5856) is a companion to ASLO''s top-rated journal Limnology and Oceanography, and articles are held to the same high standards. In order to provide the most rapid publication consistent with high standards, Limnology and Oceanography: Methods appears in electronic format only, and the entire submission and review system is online. Articles are posted as soon as they are accepted and formatted for publication. Limnology and Oceanography: Methods will consider manuscripts whose primary focus is methodological, and that deal with problems in the aquatic sciences. Manuscripts may present new measurement equipment, techniques for analyzing observations or samples, methods for understanding and interpreting information, analyses of metadata to examine the effectiveness of approaches, invited and contributed reviews and syntheses, and techniques for communicating and teaching in the aquatic sciences.