{"title":"PLS multi-step regressions in data paths","authors":"Agnar Höskuldsson","doi":"10.1016/j.chemolab.2024.105167","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105167","url":null,"abstract":"<div><p>Here is presented a procedure that extends standard PLS Regression to several data matrices in a path. The basic idea is to convert the path of data matrices into interconnected regressions. Forecasts by PLS are extended to multi-step forecasts for each data matrix in the path. We study how far we can make forecasts, i.e., how far we can ‘see’ in the path. It is shown how data paths are divided into parts, where multi-step forecasting can be carried out within each part. The principles of PLS are used to suggest criteria for estimation in the regressions. These methods can be used to supervise a complex path of industrial chemical/biological processes. It is shown how expanding and contracting paths, which is common for industrial processes, can be handled. These methods can be used to carry out analysis of general path models. It is shown briefly by an example how a Structural Equations Model, SEM, can be converted into a collection of sequential paths that can be analyzed by present methods. The results suggest that conclusions made at SEM analysis may not always be reliable. The theory is applied to process data. It is shown how we work with the analysis of each regression in a similar way as in PLS.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"251 ","pages":"Article 105167"},"PeriodicalIF":3.7,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141435166","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diana C. Fechner , RamónA. Martinez , Melisa J. Hidalgo , Adriano Araújo Gomes , Roberto G. Pellerano , Héctor C. Goicoechea
{"title":"Geographic authentication of argentinian teas by combining one-class models and discriminant methods for modeling near infrared spectra","authors":"Diana C. Fechner , RamónA. Martinez , Melisa J. Hidalgo , Adriano Araújo Gomes , Roberto G. Pellerano , Héctor C. Goicoechea","doi":"10.1016/j.chemolab.2024.105156","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105156","url":null,"abstract":"<div><p>In this study, 110 tea samples from South American countries (Argentina, Brazil, and Paraguay) and Asian countries (India and China) were analyzed using near-infrared spectroscopy (NIRS) together with a two-step chemometric authentication strategy (class modeling techniques and discriminant analysis) to authenticate commercial teas from Argentina. In the first step, one-class models were built and validated to authenticate South American teas using preprocessed NIRS data. For this purpose, data-driven soft independent modeling of class analogy (DD-SIMCA) and one-class partial least squares (OC-PLS) were used. The DD-SIMCA model gave the best results, with a sensitivity of 93.10%, specificity of 100%, and efficiency of 95.00%. In the second step, a support vector machine (SVM) was used to build and validate a multiclass model to discriminate between tea samples from Argentina and neighboring countries of South America. The best model was the combination of nine variables selected by the fast correlation-based filter (FCBF) method, with an accuracy of 98.30%. Therefore, we conclude that the combination of NIRS and two-step chemometric tools can be used to authenticate the geographical origin of samples with high inter-class similarity.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"251 ","pages":"Article 105156"},"PeriodicalIF":3.9,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141314560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras
{"title":"A generation of synthetic samples and artificial outliers via principal component analysis and evaluation of predictive capability in binary classification models","authors":"Gabriely S. Folli , Márcia H.C. Nascimento , Betina P.O. Lovatti , Wanderson Romão , Paulo R. Filgueiras","doi":"10.1016/j.chemolab.2024.105154","DOIUrl":"10.1016/j.chemolab.2024.105154","url":null,"abstract":"<div><p>Unbalanced sample groups tend to yield models with a higher prevalence of predominant classes. A sample group with balanced classes contributes to the development of more robust models with improved predictive capability to classify classes equally. In the literature, two methodologies for sample balancing can be found: elimination (undersampling) and synthetic sample generation (oversampling). Undersampling methodologies result in the loss of real samples, while oversampling methods may introduce issues related to adding non-real signals to the original spectra. To overcome these challenges, this paper aimed to utilize Principal Component Analysis (PCA) for the generation of virtual samples (synthetic samples and artificial outliers) to balance data in multivariate classification models. The proposed methodology was applied to data from mid-infrared spectroscopy (MIR) and high-resolution mass spectrometry (HRMS) with Partial Least Squares Discriminant Analysis (PLS-DA) and Support Vector Machine (SVM) models. The constructed models demonstrate that the addition of virtual samples enhances performance parameters (e.g., false negative rate, false positive rate, accuracy, sensitivity, specificity, among others) compared to unbalanced models, while also mitigating overfitting (a problem found in unbalanced models). Performance parameters exhibited a more significant improvement percentage using the non-linear model (SVM) compared to the linear model (PLS-DA). Furthermore, the created virtual spectra do not introduce new signals, i.e., original, and virtual spectra exhibit a similar spectral profile, differing only in the intensity levels. Finally, all models demonstrated good predictive capability according to permutation testing for the binary model developed in this work, limiting the rate of class permutation retention (between 40 % and 60 % of the y-vector remained in the original class). All created models exhibited accuracy values higher than the accuracy distribution of models with permuted classes for the test group.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"251 ","pages":"Article 105154"},"PeriodicalIF":3.9,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141232858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adrián Gómez-Sánchez , Raffaele Vitale , Cyril Ruckebusch , Anna de Juan
{"title":"Solving the missing value problem in PCA by Orthogonalized-Alternating Least Squares (O-ALS)","authors":"Adrián Gómez-Sánchez , Raffaele Vitale , Cyril Ruckebusch , Anna de Juan","doi":"10.1016/j.chemolab.2024.105153","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105153","url":null,"abstract":"<div><p>Dealing with missing data poses a challenge in Principal Component Analysis (PCA) since the most common algorithms are not designed to handle them. Several approaches have been proposed to solve the missing value problem in PCA, such as Imputation based on SVD (I-SVD), where missing entries are filled by imputation and updated in every iteration until convergence of the PCA model, and the adaptation of the Nonlinear Iterative Partial Least Squares (NIPALS) algorithm, able to work skipping the missing entries during the least-squares estimation of scores and loadings. However, some limitations have been reported for both approaches. On the one hand, convergence of the I-SVD algorithm can be very slow for data sets with a high percentage of missing data. On the other hand, the orthogonality properties among scores and loadings might be lost when using NIPALS.</p><p>To solve these issues and perform PCA of data sets with missing values without the need of imputation steps, a novel algorithm called Orthogonalized-Alternating Least Squares (O-ALS) is proposed. The O-ALS algorithm is an alternating least-squares algorithm that estimates the scores and loadings subject to the Gram-Schmidt orthogonalization constraint. The way to estimate scores and loadings is adapted to work only with the available information.</p><p>In this study, the performance of O-ALS is tested and compared with NIPALS and I-SVD in simulated data sets and in a real case study. The results show that O-ALS is an accurate and fast algorithm to analyze data with any percentage and distribution pattern of missing entries, being able to provide correct scores and loadings in cases where I-SVD and NIPALS do not perform satisfactorily.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105153"},"PeriodicalIF":3.9,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000935/pdfft?md5=86dff5a658083570086161657efcf7cb&pid=1-s2.0-S0169743924000935-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141241037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Aguado-Sarrió , J.M. Prats-Montalbán , J. Camps-Herrero , A. Ferrer
{"title":"Virtual biopsies for breast cancer using MCR-ALS perfusion-based biomarkers and double cross-validation PLS-DA","authors":"E. Aguado-Sarrió , J.M. Prats-Montalbán , J. Camps-Herrero , A. Ferrer","doi":"10.1016/j.chemolab.2024.105152","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105152","url":null,"abstract":"<div><p>Functional MRI is, currently, the most sensitive technique in breast cancer for detecting early tumors, and perfusion (DCE-MRI) has become the most important sequence to depict and characterize angiogenesis and neovascularization. In this work, we propose the use of new biomarkers that are related to clear physiological phenomena, obtained from MCR-ALS as an alternative to curve-based pseudo-biomarkers and pharmacokinetics models. In order to provide a discrimination and prediction model between healthy tissue and cancer, we propose using PLS-DA with double cross-validation (2CV) and variable selection, repeated several times and obtaining excellent average results for the performance indexes (f-score: 0.9149, MCC: 0.8538, AUROC: 0.8794). After selecting the optimal prediction model, a unique probabilistic map called “virtual biopsy” that shows in different colors the probability that each pixel of the image has a tumor behavior is obtained, helping the specialist with the identification and characterization of breast tumors with only one easy-to-interpret biomarker map.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105152"},"PeriodicalIF":3.9,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000923/pdfft?md5=a737eedaef6f346a82a2ab4bc9a92f6d&pid=1-s2.0-S0169743924000923-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141241036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A long sequence NOx emission prediction model for rotary kilns based on transformer","authors":"Youlin Guo, Zhizhong Mao","doi":"10.1016/j.chemolab.2024.105151","DOIUrl":"10.1016/j.chemolab.2024.105151","url":null,"abstract":"<div><p>Time-series prediction is of great practical value in industrial scenarios such as rotary kilns, especially for long sequence time-series prediction. Accurate long sequence NOx emission predictions help us monitor rotary kiln operations in advance to plan and control NOx emissions according to emission policies and production requirements. However, in actual industrial scenarios, the NOx emission pattern is dominated by long-term trends rather than simply repetitive patterns. Existing NOx prediction models are not effective in capturing long-term dependencies. Therefore, this paper proposes a novel model based on Transformer to solve this problem. First, we propose a novel series decomposition architecture based on LSTM and self-attention, which is embedded inside the Transformer. The architecture allows self-attention at the sub-series level and provides short-term trend and position information. In addition, the model designs a one-step inference structure to improve the error accumulation phenomenon under traditional inference methods for long sequence prediction and reduce the inference time. We conducted extensive experiments on two real-world datasets with different sampling intervals, which validated the model’s effectiveness. It achieves a relative improvement of 53.2% and 43.4% in prediction accuracy compared to popular NOx emission prediction methods.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"251 ","pages":"Article 105151"},"PeriodicalIF":3.9,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141145469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Riccardo Voccio , Cristina Malegori , Paolo Oliveri , Federica Branduani , Marco Arimondi , Andrea Bernardi , Giorgio Luciano , Mattia Cettolin
{"title":"Combining PLS-DA and SIMCA on NIR data for classifying raw materials for tyre industry: A hierarchical classification model","authors":"Riccardo Voccio , Cristina Malegori , Paolo Oliveri , Federica Branduani , Marco Arimondi , Andrea Bernardi , Giorgio Luciano , Mattia Cettolin","doi":"10.1016/j.chemolab.2024.105150","DOIUrl":"https://doi.org/10.1016/j.chemolab.2024.105150","url":null,"abstract":"<div><p>Tyre materials are complex products, as they are prepared using a number of raw materials, each of them with its specific chemical composition and functionality in the final product. It is, therefore, of crucial importance to avoid mislabeling errors and even to verify the compliance of raw materials entering the factory.</p><p>The present study proposes a strategy that makes use of near infrared (NIR) spectroscopy combined with chemometrics for raw material identification (RMID) and compliance verification of the most common raw materials used in the tyre industry. In particular, the chemometric model developed consists of a global hierarchical classification model, which combines nested PLS-DA nodes for RMID and SIMCA nodes for compliance verification, in a two-step approach.</p><p>The global model showed satisfactory results, as a 100 % of total correct predictions and a sensitivity higher than 90 % in the test set were obtained for most of the classes of interest.</p><p>The strategy obtained has the final goal of being directly applied on the raw materials at their receiving stage in factory, with the double advantage of minimizing the risk of mislabeling and, at the same time, decreasing the number of suspicious samples that need to be analyzed in the laboratory, by means of traditional methods, for verifying their compliance.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105150"},"PeriodicalIF":3.9,"publicationDate":"2024-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S016974392400090X/pdfft?md5=c98998e0122d4f4f2c21e7b0a46c05e0&pid=1-s2.0-S016974392400090X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An alternative for the robust assessment of the repeatability and reproducibility of analytical measurements using bivariate dispersion","authors":"Elfried Salanon , Blandine Comte , Delphine Centeno , Stéphanie Durand , Estelle Pujos-Guillot , Julien Boccard","doi":"10.1016/j.chemolab.2024.105148","DOIUrl":"10.1016/j.chemolab.2024.105148","url":null,"abstract":"<div><h3>Introduction</h3><p>Assessing repeatability and reproducibility in analytical chemistry is commonly based on parametric dispersion indicators, such as relative standard deviation and standard deviation, calculated for each detected variable using repeated measurements of Quality Control (QC) samples collected throughout the data acquisition sequence. However, their reliability strongly relies on the assumption of normality distribution. Knowing that analytical variability is conditional to many sources, the use of such parametric estimators is not always suitable. There is therefore a need for robust indicators of data quality independent of central values and any parametric assumption.</p></div><div><h3>Methods</h3><p>Three specific indicators were developed: (i) intra-group dispersion, based on the median area of the convex hull of QC samples within an analytical batch; (ii) inter-group dispersion, defined as the gradient of the deviation between analytical batches; and (iii) dispersion index. Mathematical properties of these indicators, including positivity, stability, and translation invariance, were then evaluated using synthetic data under normal and non-normal distributions. Finally, the relevance of these indicators and the associated visualization methods were highlighted based on a metabolomics case study involving liquid chromatography coupled to mass spectrometry measurements of the NIST SRM1950 reference material analyzed over more than one year within different projects.</p></div><div><h3>Results</h3><p>The proposed indicators were shown to be translation invariant and always positive, while first investigations performed on synthetic data revealed a high stability for multiplication. Moreover, their application to experimental data revealed specific behaviors depending on the characteristics of the signal associated with the different detected analytes, showing their ability to capture the variability observed either in parametric or non-parametric conditions. Moreover, this investigation showed different structures of sensitivity to analytical variability all along the data processing steps. The proposed indicators also allowed a visualization of the analytical drift in two dimensions, to facilitate result interpretation.</p></div><div><h3>Conclusion</h3><p>These indicators open the way to a better and more robust assessment of repeatability and reproducibility but also to improvements of long-term data comparability involving suitability testing.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105148"},"PeriodicalIF":3.9,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169743924000881/pdfft?md5=12d877a2bc93c6070b76e59f9583bbfc&pid=1-s2.0-S0169743924000881-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141135489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving golden jackel optimization algorithm: An application of chemical data classification","authors":"Aiedh Mrisi Alharthi , Dler Hussein Kadir , Abdo Mohammed Al-Fakih , Zakariya Yahya Algamal , Niam Abdulmunim Al-Thanoon , Maimoonah Khalid Qasim","doi":"10.1016/j.chemolab.2024.105149","DOIUrl":"10.1016/j.chemolab.2024.105149","url":null,"abstract":"<div><p>One of the main issues affecting the effectiveness of the quantitative structure-activity relationship (QSAR) classification techniques in chemometrics is high dimensionality. Applying feature selection is a critical procedure that determines the most relevant and important aspects of a dataset. It improves the effectiveness and accuracy of prediction models by effectively lowering the number of features. This decrease increases classification accuracy, reduces computing strain, and improves overall performance. Recently, the golden jackal optimization (GJO) algorithm was introduced, which has been successfully used to solve various continuous optimization issues. Therefore, this study proposes an improvement in the GJO algorithm employing chaotic maps, abbreviated as CGJO, to enhance the exploration and exploitation capability of the GJO algorithm in picking the essential descriptors in QSAR classification models with high classification accuracy and less computation time. Experimental findings based on four different high-dimensional chemical datasets show that the proposed CGJO algorithm can maximize classification accuracy while simultaneously decreasing the number of chosen descriptors and lowering the time required for computing. Thus, the proposed algorithm can be useful for chemical data classification in other QSAR modeling.</p></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"250 ","pages":"Article 105149"},"PeriodicalIF":3.9,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141034199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}