{"title":"Origin of the OECD Principles for QSAR Validation and Their Role in Changing the QSAR Paradigm Worldwide: An Historical Overview","authors":"Paola Gramatica","doi":"10.1002/cem.70014","DOIUrl":"https://doi.org/10.1002/cem.70014","url":null,"abstract":"<div>\u0000 \u0000 <p>The discussions in the QSAR community and the steps that led to the definition of the OECD Principles for the validation of QSAR models are illustrated here, framing the process in the general framework of QSAR modeling. The individual OECD Principles are presented, commenting on them in the light of significant publications that have appeared over the years, with particular attention to the aspects of statistical validation according to the chemometric approach. It will be highlighted how and to what extent the OECD Principles have influenced the subsequent work of all QSAR modelers and have led to a significant improvement in validated QSAR modeling applicable in the regulatory field and beyond.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143554667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Novel Sexalinear Decomposition Algorithm for Analyzing the Chemical Sexalinear Data Array","authors":"Yue-Yue Chang, Qiu-Na Shi, Tong Wang, Hai-Long Wu, Ru-Qin Yu","doi":"10.1002/cem.70013","DOIUrl":"https://doi.org/10.1002/cem.70013","url":null,"abstract":"<div>\u0000 \u0000 <p>With the development of analytical instrument towards more and more high-way and complex, it is very important and meaningful work to obtain ultra-high-way chemical data and explore its analytical methods. In this paper, a novel and excellent six-way algorithm combination method (six-way ACM) was proposed. In addition, a real chemically meaningful ultra-high-way sexalinear data array was obtained and constructed for the first time. The proposed six-way data array has highly collinearity, which puts forward higher requirements for parsing this data array to a certain extent. To verify the feasibility of the proposed algorithm, it was used to analyze the above real sexalinear six-way data array and a series of simulated six-way data arrays with different noise levels. The results of real data and simulated data demonstrate that the proposed method can be well used in the analysis of six-way data arrays and shows fascinating performance, including insensitive to excessive number of components, fast convergence speed, and suitable for high collinearity and high noise data. Compared with three-way, four-way, and five-way calibration methods, the six-way ACM provides higher sensitivity, a lower limit of detection, a lower limit of quantification, and more stable and accurate results, showing an outstanding “higher-order advantages” and better ability to handle collinearity problems. This work provides not only data analysis method for high-order instruments that may emerge in the future but also real data support and methodological reference for theoretical research on high-order tensor algebra.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143481608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Wavelength Selection for Limited Near-Infrared Spectral Data via Genetic Algorithm and Hybrid Regression","authors":"Esra Pamukçu","doi":"10.1002/cem.70015","DOIUrl":"https://doi.org/10.1002/cem.70015","url":null,"abstract":"<p>Spectral data often contains a large number of variables that are highly correlated. Although Partial Least Squares (PLS) regression is specifically designed to handle issues arising from limited sample sizes, its effectiveness may still diminish in e<i>x</i>tremely small datasets, making it challenging to construct a calibration model with high predictive performance. This study introduces a new framework, the Genetic Algorithm and Hybrid Regression Model (GAHRM), designed specifically for variable selection and regression in high-dimensional, low-sample-size spectral datasets. GAHRM integrates Hybrid Regression, which constructs regression models using a covariance structure that is first stabilized through Thomaz Stabilization and then regularized, with Genetic Algorithm (GA), an efficient optimization technique for selecting the best subset of variables among a vast model space. Unlike traditional approaches that rely on exhaustive search for model selection criteria, GAHRM leverages GA to navigate the exponentially large search space, enabling computationally feasible and robust model construction. The effectiveness of GAHRM was validated on the benchmark “Gasoline” dataset, where it demonstrated superior performance compared to PLS in terms of prediction accuracy and model selection efficiency. These results highlight GAHRM as a powerful alternative for wavelength selection and calibration modeling in challenging data scenarios.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70015","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143481606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Principal Components Analysis: Row Scaling and Compositional Data","authors":"Richard G. Brereton","doi":"10.1002/cem.3606","DOIUrl":"https://doi.org/10.1002/cem.3606","url":null,"abstract":"<p>Row scaling is sometimes called normalisation, but this term is also sometimes used for column standardisation, so we will avoid the latter term in this article, to prevent confusion.</p><p>Of course, whether this improvement is observed does depend on the structure of the data, but if the difference between samples is primarily due to the relative concentrations or proportions and the amount of sample is not easy to control, row scaling to constant total often results in an improvement. It can be combined with other approaches for column transformation such as standardisation as discussed in the previous article.</p><p>If there are only two variables, the simplex is a line. In Figure 4, we illustrate the scores first 2 PCs of the dataset formed by the first two variables from Table 1. We see that after row scaling there is only one non-zero PC. In this case, the position along the line relates to the class membership of each object, although this is not always so and depends on an appropriate choice of variables.</p><p>In the case of the data in Table 1, row scaling improves visualisation of the class differences and structure in the data in this case. However, row scaling is not always appropriate. If the absolute values of each variable are known accurately (e.g., the amount of sample extracted can be kept constant or calibrated to a known standard), compositional data lose information. In addition, sometimes there may be one or two very intense variables that are of subsidiary interest; for example, a primary metabolite that is very intense but has little or no relationship to the factors of interest; the proportions will be dominated by this uninteresting factor.</p><p>However, row scaling is a common procedure in many areas of chemometrics. There is a significant statistical literature about multivariate compositional data. If the main aim of an analysis is qualitative, for example, to separate groups or find outliers, often some of the more elaborate statistical considerations are of secondary importance. If, however, the data are to be used for statistical inference, such as hypothesis tests or <i>p</i> values or estimation, it is a good idea to look closely at the classical literature in order to best interpret and process compositional data.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3606","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143455747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaogang Jiang, Penghui Cheng, Kang Ge, Siwei Lv, Yande Liu
{"title":"Detection of Lead Chrome Green in Tea Based on Near-Infrared Reflectance Spectroscopy","authors":"Xiaogang Jiang, Penghui Cheng, Kang Ge, Siwei Lv, Yande Liu","doi":"10.1002/cem.70011","DOIUrl":"https://doi.org/10.1002/cem.70011","url":null,"abstract":"<div>\u0000 \u0000 <p>Tea color is a part of tea quality, and illegal addition of lead chrome green (LCG) to improve tea quality cannot be identified by human eyes. This paper is based on near-infrared (NIR) reflectance spectroscopy to detect LCG stained tea and to investigate the feasibility of qualitative and quantitative methods. Firstly, the LCG in tea was qualitatively analyzed by partial least squares discriminant analysis (PLS-DA), random forest (RF), and least squares support vector machine (LSSVM) classification models, and the results showed that the classification accuracy of LSSVM reached 100%. For quantitative analysis, Savitzky–Golay convolutional smoothing (S-G) preprocessing combined with three feature extraction algorithms, namely, joint competitive adaptive weighted sampling (CARS), uninformative variable elimination (UVE), and successive projection algorithm (SPA), were used to build partial least squares (PLS), RF, and LSSVM regression models sequentially on the preprocessed data. The S-G-UVE-LSSVM showed the best regression prediction ability in detecting LCG in tea, with a tested <i>R</i><sup>2</sup> of 0.96. These results show the feasibility of NIR spectroscopy for the detection of added LCG in tea.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143439091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Determination of Halitosis by Exhaled Breath Analysis Using Semiconductor Metal Oxide Sensors and Chemometric Methods","authors":"Mikhail Saveliev, Andrey Volchek, Galina Lavrenova, Ol'ga Malay, Mikhail Grevtsev, Igor Jahatspanian","doi":"10.1002/cem.70012","DOIUrl":"https://doi.org/10.1002/cem.70012","url":null,"abstract":"<div>\u0000 \u0000 <p>Halitosis is a condition associated with bad breath. Although halitosis is a disease in its own right, it is often a symptom of more serious diseases (diabetes mellitus, renal failure, azotemia, etc.). The currently used method for diagnosing halitosis is the organoleptic method, which relies on a trained specialist evaluating the patient's breath odor. This approach to diagnosing halitosis is subjective, uncomfortable for both patient and doctor, and necessitates the involvement of a specially trained professional. As an alternative, instrumental diagnostics employing metal oxide semiconductor (MOS) sensor arrays offer a promising avenue by enabling patient classification through predeveloped models. This paper considers the application of seven MOS sensors of different compositions at three different temperatures. Different methods of chemometric data analysis were applied: <i>k</i>-nearest neighbors (kNN), decision trees (DT), support vector machine (SVM), logistic regression (LR), and projection on latent structures discrimination analysis (PLSDA). All applied methods demonstrated their effectiveness and achieved selectivity, sensitivity, and accuracy values exceeding 85%. Additionally, a combined classifier leveraging responses from all previously studied classifiers was explored, achieving near-perfect classification accuracy.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143431714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hery Mitsutake, Eneida de Paula, Heloisa N. Bordallo, Douglas N. Rutledge
{"title":"A Multiple Linear Regression–Based Algorithm to Correct for Cosmic Rays in Raman Images","authors":"Hery Mitsutake, Eneida de Paula, Heloisa N. Bordallo, Douglas N. Rutledge","doi":"10.1002/cem.70000","DOIUrl":"https://doi.org/10.1002/cem.70000","url":null,"abstract":"<p>Raman imaging is a powerful technique for simultaneously obtaining chemical and spatial information on diverse materials. One of the most common detectors used on Raman equipment is the charge coupled detector (CCD) due its high sensitivity. However, CCDs are also sensitive to cosmic rays, that generate very narrow and intense signals: cosmic ray spikes. Since these peaks can be very intense and numerous, it is important to eliminate them before any data analysis. Some methods to do this use comparison of neighboring pixels to identify spikes, but when using the line-scanning acquisition mode, it is common that these spikes appear in two or more pixels close together. Thus, in this work, a new algorithm has been developed to correct for cosmic ray spikes in Raman images, based on multiple linear regression (MLR). This algorithm takes less than 1 min in images with more than 70,000 spectra and removes all spikes, even those at low intensity.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70000","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143431713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Stacked Modeling for Simultaneous Detection of Nutrient Concentrations With Turbidity Correction","authors":"Meryem Nini, Mohamed Nohair","doi":"10.1002/cem.70009","DOIUrl":"https://doi.org/10.1002/cem.70009","url":null,"abstract":"<div>\u0000 \u0000 <p>In this paper, an innovative method for the simultaneous determination of nitrite, nitrate, and COD in water in the presence of turbidity as a source of noise in spectroscopic data has been investigated. UV–Vis absorption spectrometry and advanced machine learning are proposed to develop a stacking model, a sophisticated modeling approach that combines several basic models (PLS, Lasso, and Ridge regression) and a meta-regressor (Random Forest regressor) to improve prediction accuracy by incorporating baseline correction and principal component analysis (PCA) to mitigate the effects of turbidity on spectroscopic data. After applying these corrections, a significant improvement was observed: The root mean square error (RMSE) and the mean absolute error (MAE) were significantly reduced, and the correlation coefficient (<i>R</i><sup>2</sup>) between predicted and actual values of nitrite, nitrate, COD, and turbidity was greater than 0.96, for all compounds in the test data set, that demonstrate the ability of the proposed stacking model to accurately predict nutrient concentrations simultaneously, even in complex environments; the proposed model may provide a valuable alternative to wet chemical methods. Due to its high accuracy and fast response, the proposed model can be used as an algorithm for the construction of nutrient sensors. This paper highlights the importance of integrating advanced modeling and data correction techniques to improve the robustness and accuracy of predictive models in environmental chemistry, thus providing valuable information for environmental monitoring and management.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143431375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maogang Li, Qi Cai, Tianlong Zhang, Hongsheng Tang, Hua Li
{"title":"Progress of Complex System Process Analysis Based on Modern Spectroscopy Combined With Chemometrics","authors":"Maogang Li, Qi Cai, Tianlong Zhang, Hongsheng Tang, Hua Li","doi":"10.1002/cem.70006","DOIUrl":"https://doi.org/10.1002/cem.70006","url":null,"abstract":"<div>\u0000 \u0000 <p>In recent years, the role of analytical chemistry has undergone a gradual transformation, evolving from a mere participant to a pivotal decision-maker in process optimisation. This shift can be attributed to the advent of sophisticated analytical instrumentation, which has ushered in a new era of analytical capabilities. This article presents a review of the developments in the application of intelligent analysis techniques, including infrared (IR) spectroscopy, Raman spectroscopy, and laser-induced breakdown spectroscopy (LIBS), in the processing of complex systems over the past decade. The review provides an introduction to the fundamental principles of these analytical techniques and examines the evolution of their instrumentation to accommodate online process monitoring. The analysis of spectral data in complex system processes represents a fundamental aspect of the attainment of on-site quality monitoring, process optimisation and control. Accordingly, the review provides a comprehensive overview of the methodologies employed in process chemometrics, encompassing spectral preprocessing, feature selection, modelling techniques, and optimisation strategies for model performance. Furthermore, this article presents a summary of three intelligent spectral analysis tools, namely infrared spectroscopy, Raman spectroscopy, and LIBS, which are widely employed in process simulation, monitoring, optimisation, and control across multiple disciplines, including the environment, energy, biology, and food. The objective of this review is to provide a valuable reference point and guidance for the further promotion and utilisation of spectral intelligent analysis instruments, with the aim of promoting their in-depth application and development in a greater number of fields.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143380521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. A. Westerhuis, A. Heintz-Buschart, H. C. J. Hoefsloot, F. M. van der Kloet, G. R. van der Ploeg, F. T. G. White
{"title":"Comparison of Chemometric Explorative Multi-Omics Data Analysis Methods Applied to a Mechanistic Pan-Cancer Cell Model","authors":"J. A. Westerhuis, A. Heintz-Buschart, H. C. J. Hoefsloot, F. M. van der Kloet, G. R. van der Ploeg, F. T. G. White","doi":"10.1002/cem.70001","DOIUrl":"https://doi.org/10.1002/cem.70001","url":null,"abstract":"<p>The analysis of single cell multi-omics data is a complex task, and many explorative data analysis methods are being used to draw information from such data. This paper compares several of these methods to visualize the output of a mechanistic model under various simulated conditions. The analysis methods include PCA, PARAFAC, ASCA, MASCARA, COVSCA, P-ESCA, and PE-ASCA. These techniques, applied to high-dimensional data such as gene expression and protein levels, assess correlations across time series and experimental conditions. The study uses a complex mechanistic model of MCF10A cancer cells, simulating interactions between signaling pathways related to cell growth and division. Results show that while methods like PCA PARAFAC and ASCA reveal time-dependent variations in protein data, mRNA data exhibit minimal systematic variation. MASCARA offers unique insights by identifying genes linked to specific pathways. This work highlights the potential and limitations of various data analysis methods in understanding multi-omics data, particularly in single-cell contexts where experimental variation and stochastic processes complicate interpretation.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 2","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70001","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143389009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}