{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király, Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":"https://doi.org/10.1002/cem.70026","url":null,"abstract":"<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model ","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim
{"title":"Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection","authors":"Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim","doi":"10.1002/cem.70023","DOIUrl":"https://doi.org/10.1002/cem.70023","url":null,"abstract":"<div>\u0000 \u0000 <p>Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (<i>R</i><sup>2</sup>). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and <i>R</i><sup>2</sup> values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL<sup>−1</sup> for CIP and 0.36 μg mL<sup>−1</sup> for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL<sup>−1</sup> for CIP and 1.21 μg mL<sup>−1</sup> for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foreword for Special Issue Devoted to the 14th Winter Symposium on Chemometrics (2024)","authors":"Anastasiia Surkova, Dmitry Kirsanov","doi":"10.1002/cem.70022","DOIUrl":"https://doi.org/10.1002/cem.70022","url":null,"abstract":"<p>The 14th Winter Symposium on Chemometrics (WSC14) was held in Tsaghkadzor (Armenia) from 26 February to 1 March 2024. The WSC is a biannual international meeting series started in Russia in 2002. Since that time WSC became an important event that is well known among other chemometric meetings for its friendly and relaxed atmosphere, rich social program and consistently high quality of scientific presentations. The scope of WSC meetings covers all relevant topics in modern chemometrics, both in theoretical developments and practical applications. In 2024, the conference was held under the auspices of the Armenian Academy of Sciences. Thirty-six participants from eight countries took part in the meeting, and the scientific program contained six lectures, 16 talks and 17 poster presentations. The invited lectures were delivered by Prof. Douglas N. Rutledge (France), Prof. Stefan Tsakovski (Bulgaria), Prof. Hadi Parastar (Iran) and Prof. Xihui Bian (China). Key lectures were presented by Dr. Alexey Pomerantsev and Dr. Oxana Rodionova. The variety of presentation topics included applications of near infrared spectrometry, hyperspectral imaging, QSPR, aquaphotomics, multiblock data analysis, machine learning, and deep learning.</p><p>The conference venue was located in a spectacular place near the Tsakhkadzor ski resort and as a part of the sportive program the participants were able to enjoy skiing in beautiful Armenian mountains. Traditional evening gatherings, so called “scores and loadings,” were conducted every conference evening with guitar playing, signing and informal discussions on all possible topics, either highly scientific or deeply prosaic. The last day of the conference was devoted to the guided tours to Sevan Lake with ancient Sevanavank monastery and to Yerevan city—the capital of hospitable Armenia.</p><p>The WSC meetings are always very friendly to young scientists, offering Best young scientist award—this year the prize was the registration for CAC-2024 (Chemometrics in Analytical Chemistry) in Argentina. The respected jury of senior chemometricians decided to award Dr. Ekaterina Boichenko for her talk “Near-infrared spectroscopy and chemometrics: a promising combination for real-time and nondestructive classification of urinary stones.” Three best poster prizes were awarded to Anastasia Sholokhova, Dr. Maria Khaydukova, and Dr. Larisa Lvova. If the feedback from participants is to be believed, all in all it was an enjoyable event. The place and the time for WSC15 will be announced soon.</p><p>Organizing committee of the 14th WSC.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143690125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beatriz Galindo-Prieto, Ian S. Mudway, Johan Linderholm, Paul Geladi
{"title":"Multi-Block Chemometric Approaches to the Unsupervised Spectral Characterization of Geological Samples","authors":"Beatriz Galindo-Prieto, Ian S. Mudway, Johan Linderholm, Paul Geladi","doi":"10.1002/cem.70010","DOIUrl":"https://doi.org/10.1002/cem.70010","url":null,"abstract":"<p>As an example for the potential use of multi-block chemometric methods to provide improved unsupervised characterization of compositionally complex materials through the integration of multi-modal spectrometric data sets, we analysed spectral data derived from five field instruments (one XRF, two NIR, and two FT-Raman), collected on 76 bedrock samples of diverse composition. These data were analysed by single- and multi- block latent variable models, based on principal component analysis (PCA) and partial least squares (PLS). For the single-block approach, PCA and PLS models were generated; whilst hierarchical partial least squares (HPLS) regression was applied for the multi-block modelling. We also tested whether dimensionality reduction resulted in a more computationally efficient muti-block HPLS model with enhanced model interpretability and geological characterization power using the variable influence on projection (VIP) feature selection method.</p><p>The results showed differences in the characterization power of the five spectrometer data sets for the bedrock samples based on their mineral composition and geological properties; moreover, some spectroscopic techniques under-performed for distinguishing samples by composition. The multi-block HPLS and its VIP-strengthened model yielded a more complete unsupervised geological aggrupation of the samples in a single parsimonious model. We conclude that multi-block HPLS models are effective at combining multi-modal spectrometric data to provide a more comprehensive characterization of compositionally complex samples, and VIP can reduce HPLS model complexity, while increasing its data interpretability. These approaches have been applied here to a geological data set, but are amenable to a broad range of applications across chemical and biomedical disciplines.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143632623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Partition-Based Cross-Validation With Centering and Scaling for \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 X\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$\u0000 and \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 Y\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$","authors":"Ole-Christian Galbo Engstrøm, Martin Holm Jensen","doi":"10.1002/cem.70008","DOIUrl":"https://doi.org/10.1002/cem.70008","url":null,"abstract":"<p>We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$</annotation>\u0000 </semantics></math>. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{Y} $$</annotation>\u0000 </semantics></math>, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 <","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}