Effect of Feature Scaling Pre-processing Techniques on Machine Learning Algorithms to Predict Particulate Matter Concentration for Gandhinagar, Gujarat, India

International Journal of Scientific Research in Science and Technology Pub Date : 2024-02-01 DOI:10.32628/ijsrst52411150

Zalak L. Thakker, Sanjay H. Buch

{"title":"Effect of Feature Scaling Pre-processing Techniques on Machine Learning Algorithms to Predict Particulate Matter Concentration for Gandhinagar, Gujarat, India","authors":"Zalak L. Thakker, Sanjay H. Buch","doi":"10.32628/ijsrst52411150","DOIUrl":null,"url":null,"abstract":"Particulate matter (PM) has widely been recognized as the primary factor responsible for air pollution, posing significant health hazards, particularly cardiovascular and respiratory diseases. Major sources of particulate matter include construction sites, power plants, industries and automobiles, landfills and agriculture, wildfires and brush/waste burning, industrial sources, wind-blown dust from open lands, pollen, and fragments of bacteria. Even though various studies have been carried out to predict particulate matter concentration, there are only a handful of papers that focus on the data scaling pre-processing aspect and how it affects the prediction. For the study, Gandhinagar Smart City Development Limited, Gandhinagar, Gujarat has provided Air Quality data from 26-1-2022 to 16-01-2023. The provided data has several challenges such as missing data, inconsistent data, and mixed data (numerical and categorical). Data pre-processing is an essential step in machine learning regression problems. Data pre-processing techniques include missing value handling, data scaling, outlier detection, feature selection/engineering, and imputation. So, this paper aims to identify the effect of the data scaling pre-processing technique to predict the concentration of Particulate Matter (PM10) for Gandhinagar, Gujarat. Data scaling will be performed based on whether data are normally distributed or not. Four data scaling techniques such as Normalizer, Robust Scaler, Min-Max Scaler, and Standard Scaler in combination with six machine learning algorithms such as Multiple Linear Regressor, Support Vector Regressor, K-Nearest Neighbour regressor, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor were compared to identify best prediction model for Particulate Matter (PM10) concentration.","PeriodicalId":14387,"journal":{"name":"International Journal of Scientific Research in Science and Technology","volume":"501 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Scientific Research in Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32628/ijsrst52411150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Particulate matter (PM) has widely been recognized as the primary factor responsible for air pollution, posing significant health hazards, particularly cardiovascular and respiratory diseases. Major sources of particulate matter include construction sites, power plants, industries and automobiles, landfills and agriculture, wildfires and brush/waste burning, industrial sources, wind-blown dust from open lands, pollen, and fragments of bacteria. Even though various studies have been carried out to predict particulate matter concentration, there are only a handful of papers that focus on the data scaling pre-processing aspect and how it affects the prediction. For the study, Gandhinagar Smart City Development Limited, Gandhinagar, Gujarat has provided Air Quality data from 26-1-2022 to 16-01-2023. The provided data has several challenges such as missing data, inconsistent data, and mixed data (numerical and categorical). Data pre-processing is an essential step in machine learning regression problems. Data pre-processing techniques include missing value handling, data scaling, outlier detection, feature selection/engineering, and imputation. So, this paper aims to identify the effect of the data scaling pre-processing technique to predict the concentration of Particulate Matter (PM10) for Gandhinagar, Gujarat. Data scaling will be performed based on whether data are normally distributed or not. Four data scaling techniques such as Normalizer, Robust Scaler, Min-Max Scaler, and Standard Scaler in combination with six machine learning algorithms such as Multiple Linear Regressor, Support Vector Regressor, K-Nearest Neighbour regressor, Decision Tree Regressor, Random Forest Regressor, and XGBoost Regressor were compared to identify best prediction model for Particulate Matter (PM10) concentration.

查看原文本刊更多论文

特征缩放预处理技术对机器学习算法预测印度古吉拉特邦甘地纳格尔颗粒物浓度的影响

人们普遍认为，颗粒物质（PM）是造成空气污染的主要因素，对人体健康，尤其是心血管疾病和呼吸系统疾病造成严重危害。颗粒物的主要来源包括建筑工地、发电厂、工业和汽车、垃圾填埋场和农业、野火和灌木丛/垃圾焚烧、工业来源、空地上的风吹尘、花粉和细菌碎片。尽管已经开展了各种研究来预测颗粒物浓度，但只有极少数论文关注数据缩放预处理方面及其对预测的影响。在这项研究中，古吉拉特邦甘地纳加尔智能城市发展有限公司提供了从 2022 年 1 月 26 日至 2023 年 1 月 16 日的空气质量数据。所提供的数据存在缺失数据、不一致数据和混合数据（数字和分类）等问题。数据预处理是机器学习回归问题中必不可少的一步。数据预处理技术包括缺失值处理、数据缩放、离群点检测、特征选择/工程和估算。因此，本文旨在确定数据缩放预处理技术对预测古吉拉特邦甘地纳加尔的颗粒物质（PM10）浓度的影响。数据缩放将根据数据是否呈正态分布来进行。将归一化、稳健缩放、最小-最大缩放和标准缩放等四种数据缩放技术与多重线性回归器、支持矢量回归器、K-最近邻回归器、决策树回归器、随机森林回归器和 XGBoost 回归器等六种机器学习算法相结合进行比较，以确定颗粒物（PM10）浓度的最佳预测模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Scientific Research in Science and Technology

自引率

0.00%

发文量