Collinear datasets augmentation using Procrustes validation sets

IF 5.7 2区 化学 Q1 CHEMISTRY, ANALYTICAL
Sergey Kucheryavskiy , Sergei Zhilin
{"title":"Collinear datasets augmentation using Procrustes validation sets","authors":"Sergey Kucheryavskiy ,&nbsp;Sergei Zhilin","doi":"10.1016/j.aca.2025.343913","DOIUrl":null,"url":null,"abstract":"<div><h3>Background:</h3><div>high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.</div></div><div><h3>Results:</h3><div>we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.</div></div><div><h3>Significance</h3><div>: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1351 ","pages":"Article 343913"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025003071","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Background:

high complexity models, such as artificial neural networks (ANN), require large datasets for training to avoid overfitting and reproducibility issues. However, experimental datasets, especially those involving spectroscopic or other highly collinear data, often suffer from limited size due to practical constraints. Currently available data augmentation methods, either do not handle collinearity well, or require resource-intensive training. Thus, there is a pressing need for an efficient, scalable method for augmenting collinear datasets to enhance model performance in both regression and classification tasks.

Results:

we propose a novel, efficient data augmentation method tailored for datasets with moderate to high collinearity, particularly spectroscopic data. This method utilizes latent variable modeling combined with cross-validation resampling to generate new data points. The approach has been validated using varios datasets, here we report detailed results for two case studies: fat content prediction in minced meat and discrimination of olives based on near-infrared spectra. In both cases, artificial neural networks were employed, resulting in significant improvements in model performance in prediction and classification. Specifically, for fat content prediction, the method reduced the root mean squared error by up to 3-fold on the independent test set.

Significance

: the proposed method provides a fast and simple solution for augmenting collinear datasets, significantly improving model performance without requiring extensive parameter tuning. It is versatile and can be applied to a range of datasets, offering a practical alternative to more complex augmentation techniques.

Abstract Image

Abstract Image

使用 Procrustes 验证集扩充共线数据集
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Analytica Chimica Acta
Analytica Chimica Acta 化学-分析化学
CiteScore
10.40
自引率
6.50%
发文量
1081
审稿时长
38 days
期刊介绍: Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信