Assessing the Impact of Temporal Resolution Using BSM1 on the Performance of Machine Learning

Journal of Korean Society of Environmental Engineers Pub Date : 2023-12-31 DOI:10.4491/ksee.2023.45.12.625

Wonki O, S. Ki, J. M. Triolo, Seung Gu Shin

{"title":"Assessing the Impact of Temporal Resolution Using BSM1 on the Performance of Machine Learning","authors":"Wonki O, S. Ki, J. M. Triolo, Seung Gu Shin","doi":"10.4491/ksee.2023.45.12.625","DOIUrl":null,"url":null,"abstract":"Objectives : This study aims to establish efficient strategies for data-driven operational management by examining the variations in machine learning modeling outcomes and data characteristics based on data acquisition intervals and methods.Methods : The BSM1 was used to simulate wastewater treatment facilities and to generate influent and effluent water quality data at 15-minute intervals. The generated data was processed by volume reduction through down sampling and data characteristic observation via resampling techniques, including up sampling through interpolation. Subsequently, the study involved a comparative analysis of the performance of 30 machine learning models built with the down sampled data.Results and Discussion : As data acquisition interval increased (i.e., down sampling progressed), R2 decreased and RMSE increased. When using the mean value as a representation, data accuracy was high, and error loss was minimal. Utilizing the maximum value as a representation helped maintain data characteristics and reduce information loss. Simple interpolation methods did not yield improved data accuracy. Furthermore, with wider data acquisition intervals, the practical predictive performance of machine learning models decreased, and the models experienced a sharp decline in performance when data became insufficient.Conclusion : For models requiring the ability to detect changes rather than accuracy, utilizing the maximum value over a specific period proves to be effective. The measurement interval of data emerges as a significant factor affecting the performance of machine learning models, with models developed under different measurement intervals often failing to demonstrate the expected performance. In this study, we have implemented all stages of data preprocessing, classification, training, and validation using LabVIEW, confirming the potential for integrating data analysis processes into LabVIEW, a widely used platform in the fields of control and measurement.","PeriodicalId":16127,"journal":{"name":"Journal of Korean Society of Environmental Engineers","volume":"26 27","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Korean Society of Environmental Engineers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4491/ksee.2023.45.12.625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives : This study aims to establish efficient strategies for data-driven operational management by examining the variations in machine learning modeling outcomes and data characteristics based on data acquisition intervals and methods.Methods : The BSM1 was used to simulate wastewater treatment facilities and to generate influent and effluent water quality data at 15-minute intervals. The generated data was processed by volume reduction through down sampling and data characteristic observation via resampling techniques, including up sampling through interpolation. Subsequently, the study involved a comparative analysis of the performance of 30 machine learning models built with the down sampled data.Results and Discussion : As data acquisition interval increased (i.e., down sampling progressed), R2 decreased and RMSE increased. When using the mean value as a representation, data accuracy was high, and error loss was minimal. Utilizing the maximum value as a representation helped maintain data characteristics and reduce information loss. Simple interpolation methods did not yield improved data accuracy. Furthermore, with wider data acquisition intervals, the practical predictive performance of machine learning models decreased, and the models experienced a sharp decline in performance when data became insufficient.Conclusion : For models requiring the ability to detect changes rather than accuracy, utilizing the maximum value over a specific period proves to be effective. The measurement interval of data emerges as a significant factor affecting the performance of machine learning models, with models developed under different measurement intervals often failing to demonstrate the expected performance. In this study, we have implemented all stages of data preprocessing, classification, training, and validation using LabVIEW, confirming the potential for integrating data analysis processes into LabVIEW, a widely used platform in the fields of control and measurement.

查看原文本刊更多论文

使用 BSM1 评估时间分辨率对机器学习性能的影响

方法：使用 BSM1 模拟污水处理设施，并以 15 分钟的间隔生成进水和出水水质数据。对生成的数据进行处理，通过向下取样减少体积，通过重新取样技术观察数据特征，包括通过插值向上取样。结果与讨论：随着数据采集间隔的增加（即向下采样的进展），R2 降低，RMSE 增加。使用平均值表示时，数据准确性高，误差损失小。使用最大值表示有助于保持数据特征，减少信息损失。简单的插值方法并不能提高数据精度。此外，随着数据采集间隔的扩大，机器学习模型的实际预测性能下降，当数据不足时，模型的性能急剧下降：对于需要检测变化能力而非准确性的模型来说，利用特定时间段内的最大值被证明是有效的。数据的测量时间间隔是影响机器学习模型性能的一个重要因素，在不同测量时间间隔下开发的模型往往无法表现出预期的性能。在本研究中，我们使用 LabVIEW 实现了数据预处理、分类、训练和验证的所有阶段，证实了将数据分析流程集成到 LabVIEW 这一在控制和测量领域广泛使用的平台的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Korean Society of Environmental Engineers

自引率

0.00%

发文量