Xiaojing Chen , Zhonghao Xie , Roma Tauler , Yong He , Pengcheng Nie , Yankun Peng , Liang Shu , Shujat Ali , Guangzao Huang , Wen Shi , Xi Chen , Leiming Yuan
{"title":"An automated preprocessing framework for near infrared spectroscopic data","authors":"Xiaojing Chen , Zhonghao Xie , Roma Tauler , Yong He , Pengcheng Nie , Yankun Peng , Liang Shu , Shujat Ali , Guangzao Huang , Wen Shi , Xi Chen , Leiming Yuan","doi":"10.1016/j.chemolab.2025.105542","DOIUrl":null,"url":null,"abstract":"<div><div>Preprocessing plays a vital role in the analysis of Near-infrared spectroscopy (NIRS) data as it aims to remove unintended artifacts. This process involves a series of steps, each with a specific focus on a particular artifact. However, due to the diverse range of NIRS applications, selecting the optimal combination of preprocessing methods remains a challenge. To address this issue, we propose an automated preprocessing framework that can quickly identify the optimal preprocessing strategy. The framework initially constructs a workflow consisting of multiple types of preprocessing methods. Then, a genetic algorithm (GA) technique is used to optimize the best pipeline, avoiding exhaustive searches. In addition, we impose a penalty for the loss function of the GA process to obtain a parsimonious solution. Results on three real-world datasets demonstrate that our approach outperforms several state-of-the-art ensemble preprocessing methods in terms of prediction error. Compared to the raw data, the optimal preprocessing method can improve model performance by at least 48%. Furthermore, our framework enables the identification of the most effective preprocessing methods included in the best pipeline. The source code for our approach is available on GitHub and can be easily integrated with other existing preprocessing techniques.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105542"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925002278","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Preprocessing plays a vital role in the analysis of Near-infrared spectroscopy (NIRS) data as it aims to remove unintended artifacts. This process involves a series of steps, each with a specific focus on a particular artifact. However, due to the diverse range of NIRS applications, selecting the optimal combination of preprocessing methods remains a challenge. To address this issue, we propose an automated preprocessing framework that can quickly identify the optimal preprocessing strategy. The framework initially constructs a workflow consisting of multiple types of preprocessing methods. Then, a genetic algorithm (GA) technique is used to optimize the best pipeline, avoiding exhaustive searches. In addition, we impose a penalty for the loss function of the GA process to obtain a parsimonious solution. Results on three real-world datasets demonstrate that our approach outperforms several state-of-the-art ensemble preprocessing methods in terms of prediction error. Compared to the raw data, the optimal preprocessing method can improve model performance by at least 48%. Furthermore, our framework enables the identification of the most effective preprocessing methods included in the best pipeline. The source code for our approach is available on GitHub and can be easily integrated with other existing preprocessing techniques.
期刊介绍:
Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines.
Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data.
The journal deals with the following topics:
1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.)
2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered.
3) Development of new software that provides novel tools or truly advances the use of chemometrical methods.
4) Well characterized data sets to test performance for the new methods and software.
The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.