化学计量学中连续不平衡数据的堆积密度估计及其超采样方法

IF 3.7 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems Pub Date : 2025-03-12 DOI:10.1016/j.chemolab.2025.105366

Xin-Ru Zhao , Lun-Zhao Yi , Guang-Hui Fu

{"title":"化学计量学中连续不平衡数据的堆积密度估计及其超采样方法","authors":"Xin-Ru Zhao , Lun-Zhao Yi , Guang-Hui Fu","doi":"10.1016/j.chemolab.2025.105366","DOIUrl":null,"url":null,"abstract":"<div><div>Continuously imbalanced data means that the target variable is continuous and its distribution is uneven. This kind of data is widespread in many practical application areas. However, methods to effectively handle continuously imbalanced data have been relatively scarce, and there is an urgent need to establish corresponding imbalance regression methods to enhance the capability of handling continuously imbalanced data. Firstly, we propose a Stacking-based density estimation (SDE) method to solve the density estimation problem of continuously imbalanced target variables. SDE links density estimation with the Ensemble learning algorithm called Stacking, and its core concept is the “fusion of multiple perspectives for accurate capture”. Performing SDE enhances the model’s understanding of complex data structures and makes it more sensitive and accurate in identifying rare values. Subsequently, we investigate an SDE-based oversampling technique (SDE-OS). SDE-OS uses SDE to synthesize new rare instances in the rare-value region, achieving fine-tuned customization of rare-value additions. In a series of numerical experiments, SDE has been estimated more accurately than the kernel density estimation method on ANLL. SDE-OS outperforms conventional sampling methods such as SMOGN and SMOTER in various metrics. Therefore, the proposed SDE and SDE-OS are highly competitive and effective tools for addressing the imbalanced regression problem.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"261 ","pages":"Article 105366"},"PeriodicalIF":3.7000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stacking density estimation and its oversampling method for continuously imbalanced data in chemometrics\",\"authors\":\"Xin-Ru Zhao , Lun-Zhao Yi , Guang-Hui Fu\",\"doi\":\"10.1016/j.chemolab.2025.105366\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Continuously imbalanced data means that the target variable is continuous and its distribution is uneven. This kind of data is widespread in many practical application areas. However, methods to effectively handle continuously imbalanced data have been relatively scarce, and there is an urgent need to establish corresponding imbalance regression methods to enhance the capability of handling continuously imbalanced data. Firstly, we propose a Stacking-based density estimation (SDE) method to solve the density estimation problem of continuously imbalanced target variables. SDE links density estimation with the Ensemble learning algorithm called Stacking, and its core concept is the “fusion of multiple perspectives for accurate capture”. Performing SDE enhances the model’s understanding of complex data structures and makes it more sensitive and accurate in identifying rare values. Subsequently, we investigate an SDE-based oversampling technique (SDE-OS). SDE-OS uses SDE to synthesize new rare instances in the rare-value region, achieving fine-tuned customization of rare-value additions. In a series of numerical experiments, SDE has been estimated more accurately than the kernel density estimation method on ANLL. SDE-OS outperforms conventional sampling methods such as SMOGN and SMOTER in various metrics. Therefore, the proposed SDE and SDE-OS are highly competitive and effective tools for addressing the imbalanced regression problem.</div></div>\",\"PeriodicalId\":9774,\"journal\":{\"name\":\"Chemometrics and Intelligent Laboratory Systems\",\"volume\":\"261 \",\"pages\":\"Article 105366\"},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2025-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemometrics and Intelligent Laboratory Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169743925000516\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925000516","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

连续不平衡数据是指目标变量连续且分布不均匀。这类数据广泛应用于许多实际应用领域。然而，有效处理连续不平衡数据的方法一直相对匮乏，迫切需要建立相应的不平衡回归方法来提高处理连续不平衡数据的能力。首先，针对连续不平衡目标变量的密度估计问题，提出了一种基于堆叠的密度估计方法。SDE将密度估计与称为Stacking的集成学习算法联系起来，其核心概念是“融合多个视角以实现准确捕获”。执行SDE增强了模型对复杂数据结构的理解，并使其在识别稀有值方面更加敏感和准确。随后，我们研究了基于sde的过采样技术（SDE-OS）。SDE- os使用SDE在稀有值区域合成新的稀有实例，实现稀有值添加的微调定制。在一系列的数值实验中，在ANLL上对SDE的估计比核密度估计方法更准确。SDE-OS在各种指标上优于传统的采样方法，如SMOGN和SMOTER。因此，提出的SDE和SDE- os是解决不平衡回归问题的极具竞争力和有效的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Stacking density estimation and its oversampling method for continuously imbalanced data in chemometrics

Continuously imbalanced data means that the target variable is continuous and its distribution is uneven. This kind of data is widespread in many practical application areas. However, methods to effectively handle continuously imbalanced data have been relatively scarce, and there is an urgent need to establish corresponding imbalance regression methods to enhance the capability of handling continuously imbalanced data. Firstly, we propose a Stacking-based density estimation (SDE) method to solve the density estimation problem of continuously imbalanced target variables. SDE links density estimation with the Ensemble learning algorithm called Stacking, and its core concept is the “fusion of multiple perspectives for accurate capture”. Performing SDE enhances the model’s understanding of complex data structures and makes it more sensitive and accurate in identifying rare values. Subsequently, we investigate an SDE-based oversampling technique (SDE-OS). SDE-OS uses SDE to synthesize new rare instances in the rare-value region, achieving fine-tuned customization of rare-value additions. In a series of numerical experiments, SDE has been estimated more accurately than the kernel density estimation method on ANLL. SDE-OS outperforms conventional sampling methods such as SMOGN and SMOTER in various metrics. Therefore, the proposed SDE and SDE-OS are highly competitive and effective tools for addressing the imbalanced regression problem.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.