{"title":"Data Quality: Importance of the ‘before analysis’ domain (Theory of Sampling, TOS)","authors":"","doi":"10.1002/cem.70025","DOIUrl":"10.1002/cem.70025","url":null,"abstract":"<p>Data analysts/chemometricians are part of a scientific collegium covering three distinct domains: i) sampling – ii) analysis – iii) data modelling, which are collectively influencing ‘data quality’. There is much more to data quality than analytical uncertainty. There are many situations where <i>analysis</i> is to be made of heterogeneous materials/batches/lots/flowing streams, which need to be <i>sampled</i> appropriately before analysis, following an often long and complex pathway ‘from-lot-to-aliquot’. In most cases, sampling and sub-sampling will <i>dominate</i> the total Measurement Uncertainty budget (MU<sub>total</sub>). Left-out MU<sub>sampling</sub> contributions may easily overwhelm the Total Analytical Error (TAE) uncertainty by factors 5, 10, 25 or <i>higher</i> as a function of the specific heterogeneity characteristics of the materials and systems targeted, and of the sampling procedure used (grab vs. composite sampling). Focus is here on the consequences of unwittingly ignoring the uncertainties originating in these domains, which e.g. will influence adversely on bilinear component directions (reducing model <i>accuracy</i>) as well as RMSE estimates reflecting <i>precision</i> (analyte concentration prediction, classification, time series prediction) and along the way will also clear up an evergreen mistake: contrary to many beliefs, ‘more data’ will <span>not</span> automatically reduce the magnitude of an unsatisfactory performance RMSE. It is shown how the Theory of Sampling (TOS) is the only guarantor of representative sampling in the critical ‘before analysis’ domain. This article introduces the essential minimum TOS competence which must be mastered by stakeholders from all three domains. The conceptual elements in the TOS <i>system</i> can be visualised as a graphic overview:</p><p>Kim H. Esbensen has been professor at three universities (National Geological Survey of Denmark and Greenland (2010–2015), Aalborg University, Denmark (2001–2010), Telemark Institute of Technology, Norway (1990–2000) and professeur associé, Université du Québec à Chicoutimi before switching to a quest as an independent consultant in 2015. He is a member of several scientific societies and has published widely across several scientific fields. He is the author of a widely used textbook in Multivariate Data Analysis (chemometrics), and in 2020 published: “Introduction to the Theory and Practice of Sampling”. He was chairman of the taskforce responsible for the world's first horizontal (matrix-independent) sampling standard DS 3077:2024 - Esbensen is the founding editor of: “Sampling Science and Technology (SST)” - https://www.sst-magazine.info/issues/ He can be reached at his homepage https://kheconsult.com/</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70025","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Quality: Importance of the ‘Before Analysis’ Domain [Theory of Sampling (TOS)]","authors":"Kim H. Esbensen","doi":"10.1002/cem.70021","DOIUrl":"10.1002/cem.70021","url":null,"abstract":"<p>Data Quality: what is it, where does it originate, how does it influence data modelling, what can chemometricians do about it? The ‘before analysis’ domain is prone to sampling errors resulting in uncertainties influencing the quality of both analysis and data analysis/data modelling. Nonrepresentative sampling of heterogeneous materials, batches, lots and process streams ‘before analysis’ contribute significantly to the total measurement uncertainty, MU<sub>total</sub> = MU<sub>sampling</sub> + MU<sub>analysis</sub>. The total sampling error (TSE) can dominate over the total analytical error (TAE) by factors ranging 5, 10 or <i>higher</i>, depending on the <i>degree</i> of material heterogeneity encountered and the specific sampling procedure employed to produce the final analytical aliquot, which is the only material actually analysed. The analytical aliquot is the physical manifestation of transgressing the boundary <span>from</span> the before analysis (sampling) domain <span>to</span> the domain of analysis. It is only possible to guarantee representativity of the analytical aliquot, and thus of the analytical results with respect to the original target batch/lot/process stream, by invoking the necessary sampling domain competence stipulated by theory of sampling (TOS). Primary sampling is the most important stage in the full lot-to-analysis pathway, quantitatively dominating MU<sub>total</sub> (but subsequent subsampling stages can also be significant). If the sources of adverse sampling error effects have not been eliminated, the sampling process is <i>biased</i> and MU<sub>total</sub> will be unnecessarily inflated. TOS offers ways and means to deal actively with a potential sampling bias (which is fundamentally different from the analytical bias). Overlooking, or deliberately ignoring dealing appropriately with sampling effects constitutes a lack of due diligence, which has critical bearings on the QC/QA demands on both analysis and data analysis/modelling. This article presents all uncertainty contributions in the lot-to-analysis-to-data modelling pathway, which must be identified and managed, eliminated or maximally reduced, to be able to document a fully minimised MU<sub>total</sub>. Data analysts/chemometricians are part of a scientific collegium covering all three domains: sampling—analysis—data modelling, which are collectively responsible for ‘data quality’. This comprehensive scope has serious implications for the current PAT paradigm, the foundation of which turns out to need significant reform regarding a key process sampling aspect regardless of whether physical samples, or PAT sensor technology spectra, are extracted/acquired. This article introduces the essential minimum TOS competence that must be mastered by stakeholders from all three domains.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70021","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143787233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liwei Feng, Yifei Wu, Shaofeng Guo, Yu Xing, Yuan Li
{"title":"Expandable Diffusion Map–Based Weighted k-Nearest Neighbor Technique for Multimode Batch Process Monitoring","authors":"Liwei Feng, Yifei Wu, Shaofeng Guo, Yu Xing, Yuan Li","doi":"10.1002/cem.70020","DOIUrl":"10.1002/cem.70020","url":null,"abstract":"<div>\u0000 \u0000 <p>The diffusion map–based <i>k</i>-nearest neighbor (DM-kNN) rule faces two challenges in multimode batch process monitoring. Firstly, the DM method encounters difficulties in projecting new samples. The training samples are repeatedly feature extracted, resulting in a time-consuming process. Faulty samples may be merged into normal samples and modeled together, which does not meet the requirements for fault detection. Secondly, DM-kNN has poor monitoring performance for multimode processes with significant variance differences. This paper proposes a technique called the expandable DM–based weighted <i>k</i>-nearest neighbor (EDM-WkNN) to solve these two issues. The expandable DM constructs a local projection matrix to attain the projecting of new samples. The effect of mode variance differences is eliminated by introducing weighted distances in statistic to overcome the difficulties caused by variance differences. We compare EDM-WkNN with classical fault detection methods through numerical examples and the fed-batch fermentation penicillin (FBFP) process. Our experiments confirm that the EDM-WkNN method effectively monitors faults in multimode batch processes.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143778291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Usman Ibrahim, Nasir Abbas, Muhammad Riaz, Tahir Mahmood
{"title":"Smart Monitoring Solutions for Real-Time Water pH Regulation in Aquatic Ecotoxicology","authors":"Usman Ibrahim, Nasir Abbas, Muhammad Riaz, Tahir Mahmood","doi":"10.1002/cem.70024","DOIUrl":"10.1002/cem.70024","url":null,"abstract":"<div>\u0000 \u0000 <p>This study designs a statistical process control tool that effectively detects small and moderate shifts in process parameters, to address challenges in quality monitoring. The proposed control chart employs advanced statistical detection techniques to enhance sensitivity while reducing false alarms, thus improving detection performance in various applications. This methodology is applied in a real-life context within an aquatic ecotoxicology laboratory, where daily monitoring of water pH levels is essential for safeguarding the health of sensitive aquatic organisms, such as mysids. The laboratory environment is meticulously controlled to simulate natural conditions, and our application of the proposed control chart ensures that any deviations from the optimal pH level are detected promptly, thereby maintaining water quality and supporting the reliability of experimental outcomes. The paper comprehensively evaluates the performance of the proposed control chart in both zero-state and steady-state conditions, offering valuable insights for practitioners in the field. We present empirical evidence demonstrating that the proposed control chart significantly outperforms traditional control charts, including Shewhart, CUSUM, and EWMA, particularly in detecting small to moderate shifts in water pH levels. Furthermore, we provide optimal parameter settings tailored for specific monitoring scenarios, enhancing the applicability of proposed control chart for quality control in laboratory environments.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143770223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király, Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":"10.1002/cem.70026","url":null,"abstract":"<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model ","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143749606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim
{"title":"Green and Rapid Quantification of Ciprofloxacin Hydrochloride and Tylosin Tartrate in Veterinary Formulation using UV Spectrophotometric Method: A Comparative Study of Nature-Inspired Algorithms for Feature Selection","authors":"Mostafa M. Eraqi, Ayman M. Algohary, Youssef O. Al-Ghamdi, Ahmed M. Ibrahim","doi":"10.1002/cem.70023","DOIUrl":"10.1002/cem.70023","url":null,"abstract":"<div>\u0000 \u0000 <p>Rapid and accurate quantification of ciprofloxacin hydrochloride (CIP) and tylosin tartrate (TYZ) in veterinary formulations is crucial for ensuring product quality and therapeutic efficacy. This study introduces a green and cost-effective analytical method that combines the simplicity of UV spectrophotometry with the optimization power of nature-inspired algorithms for the simultaneous determination of CIP and TYZ in a tablet veterinary formulation. Fourteen nature-inspired algorithms were comparatively assessed using root average squared error (RASE), average absolute error (AAE), and the coefficient of determination (<i>R</i><sup>2</sup>). The Corona virus optimization (CVO) algorithm and the Bat algorithm demonstrated superior performance for CIP and TYZ, respectively. The CVO algorithm, optimized for CIP, exhibited RASE, AAE, and <i>R</i><sup>2</sup> values of 0.37, 0.27, and 0.998, respectively, for the calibration set, while the bat algorithm, tailored for TYZ, yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.54, 0.41, and 0.984. Test sets yielded RASE, AAE, and <i>R</i><sup>2</sup> values of 0.55, 0.46, and 0.991 for CIP and 0.20, 0.15, and 0.995 for TYZ, respectively, confirming the algorithms predictive ability. Validation was performed using the accuracy profile approach. The limits of detection (LODs) were determined to be 0.86 μg mL<sup>−1</sup> for CIP and 0.36 μg mL<sup>−1</sup> for TYZ, while the limits of quantification (LOQs) were calculated as 2.88 μg mL<sup>−1</sup> for CIP and 1.21 μg mL<sup>−1</sup> for TYZ. The method environmental impact was comprehensively assessed using The Green Solvent Selection Tool (GSST), The National Environmental Methods Index (NEMI), a modified Eco-Scale, the Modified GAPI (MoGAPI), and a complementary whiteness evaluation via the RGBfast algorithm, confirming its eco-friendly profile. The proposed method demonstrated superior greenness, as reflected in its elevated GSST scores and favorable NEMI assessment. Specifically, the method achieved a modified Eco-Scale score of 84, a MoGAPI score of 81, and a whiteness index of 61, as determined by the RGBfast algorithm. These results confirm the method environmentally sustainable profile, reinforcing its suitability for green analytical applications. This novel approach offers significant advantages in terms of cost, speed, and environmental sustainability compared to conventional chromatographic techniques, paving the way for more efficient and greener analytical methods in pharmaceutical quality control. Furthermore, this study highlights the innovative integration of UV spectroscopy with nature-inspired algorithms, demonstrating significant advancements over conventional UV methodologies for pharmaceutical analysis.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Foreword for Special Issue Devoted to the 14th Winter Symposium on Chemometrics (2024)","authors":"Anastasiia Surkova, Dmitry Kirsanov","doi":"10.1002/cem.70022","DOIUrl":"10.1002/cem.70022","url":null,"abstract":"<p>The 14th Winter Symposium on Chemometrics (WSC14) was held in Tsaghkadzor (Armenia) from 26 February to 1 March 2024. The WSC is a biannual international meeting series started in Russia in 2002. Since that time WSC became an important event that is well known among other chemometric meetings for its friendly and relaxed atmosphere, rich social program and consistently high quality of scientific presentations. The scope of WSC meetings covers all relevant topics in modern chemometrics, both in theoretical developments and practical applications. In 2024, the conference was held under the auspices of the Armenian Academy of Sciences. Thirty-six participants from eight countries took part in the meeting, and the scientific program contained six lectures, 16 talks and 17 poster presentations. The invited lectures were delivered by Prof. Douglas N. Rutledge (France), Prof. Stefan Tsakovski (Bulgaria), Prof. Hadi Parastar (Iran) and Prof. Xihui Bian (China). Key lectures were presented by Dr. Alexey Pomerantsev and Dr. Oxana Rodionova. The variety of presentation topics included applications of near infrared spectrometry, hyperspectral imaging, QSPR, aquaphotomics, multiblock data analysis, machine learning, and deep learning.</p><p>The conference venue was located in a spectacular place near the Tsakhkadzor ski resort and as a part of the sportive program the participants were able to enjoy skiing in beautiful Armenian mountains. Traditional evening gatherings, so called “scores and loadings,” were conducted every conference evening with guitar playing, signing and informal discussions on all possible topics, either highly scientific or deeply prosaic. The last day of the conference was devoted to the guided tours to Sevan Lake with ancient Sevanavank monastery and to Yerevan city—the capital of hospitable Armenia.</p><p>The WSC meetings are always very friendly to young scientists, offering Best young scientist award—this year the prize was the registration for CAC-2024 (Chemometrics in Analytical Chemistry) in Argentina. The respected jury of senior chemometricians decided to award Dr. Ekaterina Boichenko for her talk “Near-infrared spectroscopy and chemometrics: a promising combination for real-time and nondestructive classification of urinary stones.” Three best poster prizes were awarded to Anastasia Sholokhova, Dr. Maria Khaydukova, and Dr. Larisa Lvova. If the feedback from participants is to be believed, all in all it was an enjoyable event. The place and the time for WSC15 will be announced soon.</p><p>Organizing committee of the 14th WSC.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70022","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143690125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beatriz Galindo-Prieto, Ian S. Mudway, Johan Linderholm, Paul Geladi
{"title":"Multi-Block Chemometric Approaches to the Unsupervised Spectral Characterization of Geological Samples","authors":"Beatriz Galindo-Prieto, Ian S. Mudway, Johan Linderholm, Paul Geladi","doi":"10.1002/cem.70010","DOIUrl":"10.1002/cem.70010","url":null,"abstract":"<p>As an example for the potential use of multi-block chemometric methods to provide improved unsupervised characterization of compositionally complex materials through the integration of multi-modal spectrometric data sets, we analysed spectral data derived from five field instruments (one XRF, two NIR, and two FT-Raman), collected on 76 bedrock samples of diverse composition. These data were analysed by single- and multi- block latent variable models, based on principal component analysis (PCA) and partial least squares (PLS). For the single-block approach, PCA and PLS models were generated; whilst hierarchical partial least squares (HPLS) regression was applied for the multi-block modelling. We also tested whether dimensionality reduction resulted in a more computationally efficient muti-block HPLS model with enhanced model interpretability and geological characterization power using the variable influence on projection (VIP) feature selection method.</p><p>The results showed differences in the characterization power of the five spectrometer data sets for the bedrock samples based on their mineral composition and geological properties; moreover, some spectroscopic techniques under-performed for distinguishing samples by composition. The multi-block HPLS and its VIP-strengthened model yielded a more complete unsupervised geological aggrupation of the samples in a single parsimonious model. We conclude that multi-block HPLS models are effective at combining multi-modal spectrometric data to provide a more comprehensive characterization of compositionally complex samples, and VIP can reduce HPLS model complexity, while increasing its data interpretability. These approaches have been applied here to a geological data set, but are amenable to a broad range of applications across chemical and biomedical disciplines.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143632623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Partition-Based Cross-Validation With Centering and Scaling for \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 X\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$\u0000 and \u0000 \u0000 \u0000 \u0000 \u0000 X\u0000 \u0000 \u0000 T\u0000 \u0000 \u0000 Y\u0000 \u0000 $$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$","authors":"Ole-Christian Galbo Engstrøm, Martin Holm Jensen","doi":"10.1002/cem.70008","DOIUrl":"10.1002/cem.70008","url":null,"abstract":"<p>We present algorithms that substantially accelerate partition-based cross-validation for machine learning models that require matrix products <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{Y} $$</annotation>\u0000 </semantics></math>. Our algorithms have applications in model selection for, for example, principal component analysis (PCA), principal component regression (PCR), ridge regression (RR), ordinary least squares (OLS), and partial least squares (PLS). Our algorithms support all combinations of column-wise centering and scaling of <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <mi>Y</mi>\u0000 </mrow>\u0000 <annotation>$$ mathbf{Y} $$</annotation>\u0000 </semantics></math>, and we demonstrate in our accompanying implementation that this adds only a manageable, practical constant over efficient variants without preprocessing. We prove the correctness of our algorithms under a fold-based partitioning scheme and show that the running time is independent of the number of folds; that is, they have the same time complexity as that of computing <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <mrow>\u0000 <mi>T</mi>\u0000 </mrow>\u0000 </msup>\u0000 <mi>X</mi>\u0000 </mrow>\u0000 <annotation>$$ {mathbf{X}}^{mathbf{T}}mathbf{X} $$</annotation>\u0000 </semantics></math> and <span></span><math>\u0000 <semantics>\u0000 <mrow>\u0000 <msup>\u0000 <mrow>\u0000 <mi>X</mi>\u0000 <","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}