蒙特卡罗峰：模拟数据集，以基准机器学习算法为临床光谱

IF 3.8 2区化学 Q2 AUTOMATION & CONTROL SYSTEMS

Chemometrics and Intelligent Laboratory Systems Pub Date : 2025-10-08 DOI:10.1016/j.chemolab.2025.105548

Jaume Béjar-Grimalt , Ángel Sánchez-Illana , Guillermo Quintás , Hugh J. Byrne , David Pérez-Guaita

{"title":"蒙特卡罗峰：模拟数据集，以基准机器学习算法为临床光谱","authors":"Jaume Béjar-Grimalt , Ángel Sánchez-Illana , Guillermo Quintás , Hugh J. Byrne , David Pérez-Guaita","doi":"10.1016/j.chemolab.2025.105548","DOIUrl":null,"url":null,"abstract":"<div><div>Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as <em>black-box</em> models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"267 ","pages":"Article 105548"},"PeriodicalIF":3.8000,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Monte Carlo peaks: Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy\",\"authors\":\"Jaume Béjar-Grimalt , Ángel Sánchez-Illana , Guillermo Quintás , Hugh J. Byrne , David Pérez-Guaita\",\"doi\":\"10.1016/j.chemolab.2025.105548\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as <em>black-box</em> models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.</div></div>\",\"PeriodicalId\":9774,\"journal\":{\"name\":\"Chemometrics and Intelligent Laboratory Systems\",\"volume\":\"267 \",\"pages\":\"Article 105548\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-10-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Chemometrics and Intelligent Laboratory Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169743925002333\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemometrics and Intelligent Laboratory Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169743925002333","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

红外光谱和拉曼光谱在临床应用中具有很大的前景。然而，相关光谱数据的固有复杂性需要使用先进的机器学习技术，这些技术虽然在提取生物信息方面功能强大，但通常以黑箱模型的方式运行。再加上缺乏标准化的数据集，这阻碍了模型优化、可解释性以及对越来越多新开发的机器学习方法进行系统的基准测试。为了解决这个问题，我们提出了一个基于模拟的框架，用于使用蒙特卡罗方法进行基准测试来生成完全合成的光谱数据集。人工数据集模拟了广泛的现实场景，包括重叠的光谱标记和非判别特征，并且可以调整以模拟不同参数的影响，如仪器噪声、干扰数量和样本量。这些光谱是通过在中红外范围内产生洛伦兹波段来模拟的，而不需要具体参考实验数据或化学结构。我们使用所提出的方法在偏最小二乘判别分析（PLS-DA）中比较了不同的光谱标记识别方案，结果表明，正交PLS-DA （OPLS-DA）方法与基于VIP评分或回归向量的标记选择相结合时，比使用相同选择标准的标准PLS-DA产生更高的灵敏度、特异性和可解释性。该框架进一步用于对常用机器学习算法的分类能力进行基准测试，结合反映目标类别组成变化的线性和非线性标记。关键发现是通过临床研究中收集的人类血清和唾液的真实红外光谱进行验证的。总的来说，该方法为振动光谱数据分析策略的系统评估提供了一个通用的沙盒环境，可以帮助实验人员更好地解释光谱标记或数据分析人员专注于基准测试和验证新算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Monte Carlo peaks: Simulated datasets to benchmark machine learning algorithms for clinical spectroscopy

Infrared and Raman spectroscopy hold great promise for clinical applications. However, the inherent complexity of the associated spectral data necessitates the use of advanced machine learning techniques which, while powerful in extracting biological information, often operate as black-box models. Combined with the absence of standardized datasets, this hinders model optimization, interpretability, and the systematic benchmarking of the growing number of newly developed machine learning methods. To address this, we propose a simulation-based framework for generating fully synthetic spectral datasets using Monte Carlo approaches for benchmarking. The artificial datasets mimic a wide range of realistic scenarios, including overlapping spectral markers and non-discriminant features and can be adjusted to simulate the effect of different parameters, such as instrumental noise, number of interferences, and sample size. These spectra are simulated through the generation of Lorentzian bands across the mid-infrared range, without specific reference to experimental data or chemical structures. We used the proposed methodology to compare different spectral marker identification protocols in a partial least squares discriminant analysis (PLS-DA), showing that the orthogonal PLS-DA (OPLS-DA) approach, when combined with marker selection based on VIP scores or the regression vector, yielded higher sensitivity, specificity, and interpretability than standard PLS-DA using the same selection criteria. This framework was further used to benchmark the classification capabilities of commonly employed machine learning algorithms, incorporating both linear and non-linear markers reflective of compositional variations across the target classes. Key findings were validated using real infrared spectra from human blood serum and saliva collected in the frame of a clinical study. Overall, the proposed approach provides a versatile sandbox environment for the systematic evaluation of data analysis strategies in vibrational spectroscopy, that can help experimentalists to better interpret spectral markers or data analysts focused on benchmarking and validating new algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Chemometrics and Intelligent Laboratory Systems 工程技术-分析化学

CiteScore

7.50

自引率

7.70%

发文量

169

审稿时长

3.4 months

期刊介绍： Chemometrics and Intelligent Laboratory Systems publishes original research papers, short communications, reviews, tutorials and Original Software Publications reporting on development of novel statistical, mathematical, or computer techniques in Chemistry and related disciplines. Chemometrics is the chemical discipline that uses mathematical and statistical methods to design or select optimal procedures and experiments, and to provide maximum chemical information by analysing chemical data. The journal deals with the following topics: 1) Development of new statistical, mathematical and chemometrical methods for Chemistry and related fields (Environmental Chemistry, Biochemistry, Toxicology, System Biology, -Omics, etc.) 2) Novel applications of chemometrics to all branches of Chemistry and related fields (typical domains of interest are: process data analysis, experimental design, data mining, signal processing, supervised modelling, decision making, robust statistics, mixture analysis, multivariate calibration etc.) Routine applications of established chemometrical techniques will not be considered. 3) Development of new software that provides novel tools or truly advances the use of chemometrical methods. 4) Well characterized data sets to test performance for the new methods and software. The journal complies with International Committee of Medical Journal Editors'' Uniform requirements for manuscripts.