Collinear datasets augmentation using Procrustes validation sets

IF 5.7 2区化学 Q1 CHEMISTRY, ANALYTICAL

Analytica Chimica Acta Pub Date : 2025-03-15 DOI:10.1016/j.aca.2025.343913

Sergey Kucheryavskiy , Sergei Zhilin

{"title":"Collinear datasets augmentation using Procrustes validation sets","authors":"Sergey Kucheryavskiy , Sergei Zhilin","doi":"10.1016/j.aca.2025.343913","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.</div></div><div><h3>Results:</h3><div>we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.</div></div><div><h3>Significance</h3><div>: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1351 ","pages":"Article 343913"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025003071","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background:

high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.

Results:

we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.

Significance

: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.

Abstract Image

查看原文本刊更多论文

使用 Procrustes 验证集扩充共线数据集

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Analytica Chimica Acta 化学-分析化学

CiteScore

10.40

自引率

6.50%

发文量

1081

审稿时长

38 days

期刊介绍： Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.