Anuradha Yenkikar , Ved Prakash Mishra , Manish Bali , Tabassum Ara
{"title":"Explainable forecasting of air quality index using a hybrid random forest and ARIMA model","authors":"Anuradha Yenkikar , Ved Prakash Mishra , Manish Bali , Tabassum Ara","doi":"10.1016/j.mex.2025.103517","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate and interpretable prediction of the Air Quality Index (AQI) is critical for public health decision-making and environmental policy enforcement. This study presents a hybrid forecasting framework that combines the strengths of Random Forest Regression (RFR) and Autoregressive Integrated Moving Average (ARIMA) models to improve AQI prediction accuracy while maintaining model transparency. The RFR captures nonlinear relationships among pollutants, while ARIMA is used to model the temporal patterns in RFR residuals, forming a two-stage learning architecture. The model is trained and evaluated on multi-year AQI data from India and validated using an expanding window cross-validation strategy to maintain temporal integrity. To ensure transparency and interpretability, the study employs SHAP ((SHapley Additive Explanations) to uncover the influence of key pollutants such as PM₂.₅, NO₂, and SO₂. Additionally, Ljung-Box diagnostics and uncertainty bands are used to validate model adequacy. Compared to baseline models, the hybrid approach achieves lower Mean Squared Error (MSE = 508.46) and a higher R² score (0.94), confirming improved generalization. This research contributes a replicable, explainable, and efficient AQI forecasting framework suited for deployment in resource-constrained urban environments. The method comprises of:</div><div>Residual learning hybrid model: Random Forest for prediction + ARIMA for residual correction</div><div>Time-aware validation using expanding window cross-validation</div><div>Model interpretability through SHAP analysis</div></div>","PeriodicalId":18446,"journal":{"name":"MethodsX","volume":"15 ","pages":"Article 103517"},"PeriodicalIF":1.9000,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MethodsX","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2215016125003619","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate and interpretable prediction of the Air Quality Index (AQI) is critical for public health decision-making and environmental policy enforcement. This study presents a hybrid forecasting framework that combines the strengths of Random Forest Regression (RFR) and Autoregressive Integrated Moving Average (ARIMA) models to improve AQI prediction accuracy while maintaining model transparency. The RFR captures nonlinear relationships among pollutants, while ARIMA is used to model the temporal patterns in RFR residuals, forming a two-stage learning architecture. The model is trained and evaluated on multi-year AQI data from India and validated using an expanding window cross-validation strategy to maintain temporal integrity. To ensure transparency and interpretability, the study employs SHAP ((SHapley Additive Explanations) to uncover the influence of key pollutants such as PM₂.₅, NO₂, and SO₂. Additionally, Ljung-Box diagnostics and uncertainty bands are used to validate model adequacy. Compared to baseline models, the hybrid approach achieves lower Mean Squared Error (MSE = 508.46) and a higher R² score (0.94), confirming improved generalization. This research contributes a replicable, explainable, and efficient AQI forecasting framework suited for deployment in resource-constrained urban environments. The method comprises of:
Residual learning hybrid model: Random Forest for prediction + ARIMA for residual correction
Time-aware validation using expanding window cross-validation