Mingyang Liu, Jiake Li, Yafang Li, Weijie Gao, Jingkun Lu
{"title":"Data-driven identification of pollution sources and water quality prediction using Apriori and LSTM models: A case study in the Hanjiang River basin","authors":"Mingyang Liu, Jiake Li, Yafang Li, Weijie Gao, Jingkun Lu","doi":"10.1016/j.jconhyd.2025.104570","DOIUrl":null,"url":null,"abstract":"<div><div>The rapid development of urbanization and industrialization has exacerbated surface water pollution, especially from point sources such as industrial discharge and urban wastewater, posing a severe challenge to global environmental health and sustainable development. This study combines the Apriori algorithm and Long Short-Term Memory (LSTM) networks to identify major pollution sources and predict dynamic changes in water quality. The study area encompasses four national monitoring hydrological stations in the core area of the South-to-North Water Diversion Project, with multi-source data collected, including water quality parameters and industry-specific discharge data. Using the Apriori algorithm, the pollutants with the highest support—chemical oxygen demand (COD), copper (Cu), suspended solids (SS), and zinc (Zn)—demonstrated a support value of 0.87, indicating that the metallurgical, electroplating, and chemical industries are the primary pollution sources. Further association rule analysis based on varying parameter thresholds revealed that when COD is present, the co-occurrence confidence for Cadmium (Cd), Cu, Lead (Pb), and SS reaches 0.9, and the combination of COD, Cu, Pb, SS, and Cyanide (CN) achieves a confidence level of 1, indicating a high degree of correlation among these pollutants. The LSTM model demonstrated high accuracy in water quality prediction, with Root Mean Square Error (RMSE) values for COD predictions at each hydrological station ranging from 0.2076 to 0.3366, and coefficients of determination (R<sup>2</sup>) all exceeding 0.9, highlighting the model's stability and predictive accuracy. This study provides a scientific basis for the sustainable management of watershed water resources and serves as a significant reference for environmental policymaking and water resource protection.</div></div>","PeriodicalId":15530,"journal":{"name":"Journal of contaminant hydrology","volume":"272 ","pages":"Article 104570"},"PeriodicalIF":3.5000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of contaminant hydrology","FirstCategoryId":"93","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169772225000750","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}
引用次数: 0
Abstract
The rapid development of urbanization and industrialization has exacerbated surface water pollution, especially from point sources such as industrial discharge and urban wastewater, posing a severe challenge to global environmental health and sustainable development. This study combines the Apriori algorithm and Long Short-Term Memory (LSTM) networks to identify major pollution sources and predict dynamic changes in water quality. The study area encompasses four national monitoring hydrological stations in the core area of the South-to-North Water Diversion Project, with multi-source data collected, including water quality parameters and industry-specific discharge data. Using the Apriori algorithm, the pollutants with the highest support—chemical oxygen demand (COD), copper (Cu), suspended solids (SS), and zinc (Zn)—demonstrated a support value of 0.87, indicating that the metallurgical, electroplating, and chemical industries are the primary pollution sources. Further association rule analysis based on varying parameter thresholds revealed that when COD is present, the co-occurrence confidence for Cadmium (Cd), Cu, Lead (Pb), and SS reaches 0.9, and the combination of COD, Cu, Pb, SS, and Cyanide (CN) achieves a confidence level of 1, indicating a high degree of correlation among these pollutants. The LSTM model demonstrated high accuracy in water quality prediction, with Root Mean Square Error (RMSE) values for COD predictions at each hydrological station ranging from 0.2076 to 0.3366, and coefficients of determination (R2) all exceeding 0.9, highlighting the model's stability and predictive accuracy. This study provides a scientific basis for the sustainable management of watershed water resources and serves as a significant reference for environmental policymaking and water resource protection.
期刊介绍:
The Journal of Contaminant Hydrology is an international journal publishing scientific articles pertaining to the contamination of subsurface water resources. Emphasis is placed on investigations of the physical, chemical, and biological processes influencing the behavior and fate of organic and inorganic contaminants in the unsaturated (vadose) and saturated (groundwater) zones, as well as at groundwater-surface water interfaces. The ecological impacts of contaminants transported both from and to aquifers are of interest. Articles on contamination of surface water only, without a link to groundwater, are out of the scope. Broad latitude is allowed in identifying contaminants of interest, and include legacy and emerging pollutants, nutrients, nanoparticles, pathogenic microorganisms (e.g., bacteria, viruses, protozoa), microplastics, and various constituents associated with energy production (e.g., methane, carbon dioxide, hydrogen sulfide).
The journal''s scope embraces a wide range of topics including: experimental investigations of contaminant sorption, diffusion, transformation, volatilization and transport in the surface and subsurface; characterization of soil and aquifer properties only as they influence contaminant behavior; development and testing of mathematical models of contaminant behaviour; innovative techniques for restoration of contaminated sites; development of new tools or techniques for monitoring the extent of soil and groundwater contamination; transformation of contaminants in the hyporheic zone; effects of contaminants traversing the hyporheic zone on surface water and groundwater ecosystems; subsurface carbon sequestration and/or turnover; and migration of fluids associated with energy production into groundwater.