{"title":"Collinear datasets augmentation using Procrustes validation sets","authors":"Sergey Kucheryavskiy , Sergei Zhilin","doi":"10.1016/j.aca.2025.343913","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.</div></div><div><h3>Results:</h3><div>we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.</div></div><div><h3>Significance</h3><div>: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1351 ","pages":"Article 343913"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025003071","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background:
high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.
Results:
we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.
Significance
: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.
期刊介绍:
Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review.
Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.