{"title":"Self-Optimizing Radial Basis Function Support Vector Classifier (SO-RBFSVC)","authors":"Qudus Ayodeji Thanni, Peter de Boves Harrington","doi":"10.1002/cem.70038","DOIUrl":"https://doi.org/10.1002/cem.70038","url":null,"abstract":"<p>Support vector classifiers (SVCs) typically use radial basis function (RBF) kernels to map data into higher dimensional spaces that may improve the linear separation of otherwise nonseparable classes. We present a novel self-optimizing radial basis function support vector classifier (SO-RBFSVC) that integrates response surface methodology (RSM), two-dimensional cubic spline interpolation, and bootstrapped Latin partitions (BLPs) for automated hyperparameter tuning. The SO-RBFSVC simultaneously optimizes the RBF kernel width (<i>σ</i>) and cost parameter (<i>C</i>) using an interpolated response surface obtained from generalized prediction accuracies. The SO-RBFSVC was compared to other self-optimizing classifiers (super SVC [sSVC] and super partial least squares discriminant analysis [sPLS-DA]). Four datasets were evaluated: (i) hemp and marijuana discrimination using proton nuclear magnetic resonance spectra, (ii) barley growth location using near-infrared spectra, (iii) glass-type identification based on elemental composition, and (iv) wine cultivar classification from physicochemical properties. External validation results showed that SO-RBFSVC performed comparably to the other models, achieving error rates of 0.4 ± 0.5% for hemp/marijuana, 7 ± 1% for glass, and 6 ± 1% for wine, while outperforming the linear models with 10 ± 1% error for the barley NIR data. For the first time, generalized sensitivity analysis (GSA) was applied to quantify model linearity. GSA revealed high nonlinearity in the barley dataset, justifying a nonlinear model. The SO-RBFSVC provides robust, automated classifier tuning for low- and high-dimensional datasets, offering ease of use.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 6","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144140398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How Are Chemometric Models Validated? A Systematic Review of Linear Regression Models for NIRS Data in Food Analysis","authors":"Jokin Ezenarro, Daniel Schorn-García","doi":"10.1002/cem.70036","DOIUrl":"https://doi.org/10.1002/cem.70036","url":null,"abstract":"<p>Chemometric models play a critical role in the spectroscopic analysis of food, particularly with near-infrared spectroscopy (NIRS), enabling the accurate prediction and monitoring of physicochemical properties. Although chemometric methods have proven to be useful tools in NIRS analysis, their reliability depends on rigorous validation to ensure the rigour of their predictions and their applicability. This systematic review examines validation strategies applied to regression models in NIRS-based food analysis, emphasising the use of cross-validation, external validation and figures of merit (FoM) as key evaluation tools. This comprehensive literature search identified trends in validation methodologies, highlighting frequent reliance on partial least squares (PLS) regression and common flaws in validation methodologies and their reporting. While external validation is considered the best approach, many studies lack it and employ cross-validation methods solely, which may lead to overoptimistic model performance estimates. Furthermore, inconsistencies in the selection and definition of FoM hinder direct comparison across studies. This review underscores the need for increased methodological transparency and rigour in the validation of chemometric models to enhance their reliability.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 6","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144085154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"De Novo Design of HIV-1 Integrase-LEDGF/p75 Inhibitors Through Deep Reinforcement Learning and Virtual Screening","authors":"Hai-Bo Sun, Hai-Long Wu, Tong Wang, An-Qi Chen, Ru-Qin Yu","doi":"10.1002/cem.70037","DOIUrl":"https://doi.org/10.1002/cem.70037","url":null,"abstract":"<div>\u0000 \u0000 <p>Human immunodeficiency virus (HIV) has far-reaching impacts on global public health. Acquired immunodeficiency syndrome (AIDS) has caused millions of deaths globally, with thousands still getting infected. Therefore, developing HIV-1 integrase inhibitors is crucial for controlling AIDS by slowing virus replication and transmission. This study is grounded in the framework of deep reinforcement learning, aiming to de novo design inhibitors of HIV-1 integrase-Lens Epithelial-Derived Growth Factor/p75 interaction and subsequently employing molecular docking to screen potential therapeutic compounds. Initially, a molecular generation model was established based on the long short-term memory algorithm and refined through transfer learning to obtain a preliminary generative model. Subsequently, the deep reinforcement learning strategy was employed, using inhibition activity as a reward value, enabling the model more likely to generate molecules with desirable properties. The results indicate that the reinforced generation model not only generates novel and effective SMILES structures with medicinal potential but also demonstrates strong binding affinity between the generated molecules and the target protein, as indicated by molecular docking experiments. Ultimately, through virtual screening, we identified six lead compounds having the potential to become inhibitors of interaction between Lens Epithelial-Derived Growth Factor/p75 and HIV-1 integrase, providing an effective and practical strategy for de novo drug design of HIV-1 integrase inhibitors.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143939411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ishrat Riaz, Aamir Sanaullah, Mustafa M. Hasaballah, Oluwafemi Samson Balogun, Mahmoud E. Bakr
{"title":"A Novel Two-Parameter Estimation Technique for Handling Multicollinearity in Inverse Gaussian Regression Model","authors":"Ishrat Riaz, Aamir Sanaullah, Mustafa M. Hasaballah, Oluwafemi Samson Balogun, Mahmoud E. Bakr","doi":"10.1002/cem.70032","DOIUrl":"https://doi.org/10.1002/cem.70032","url":null,"abstract":"<div>\u0000 \u0000 <p>This study focuses on the prevalent issue of multicollinearity in the inverse Gaussian regression model (IGRM), which arises when predictor variables have a high degree of correlation. The typical maximum likelihood estimator (MLE) proves to be highly unstable when dealing with linearly linked regressors. Eventually, the accuracy of the model may suffer because of inflated variances and inaccurate coefficient estimates. To improve parameter estimation accuracy and combat multicollinearity, this paper suggests an alternative biased estimator for the IGRM that integrates a two-parameter framework. This novel two-parameter estimator is a general estimator that takes the maximum likelihood, ridge, and Stein estimators as special cases. The theoretical characteristics of the estimator, including its bias and mean squared error (MSE), are develop and then go through a thorough theoretical comparison with the previous estimators in terms of the mean square error matrix (MMSE) criterion. Moreover, the optimal values of the biasing parameters for the advised estimator are also obtained. An extensive simulated study and real-world dataset are examined to assess the practical relevance of the proposed estimator. The empirical results show that, in comparison to conventional estimators, including MLE, ridge, and Stein estimators, the suggested estimator considerably lowers the MSE and improves the parameter estimation accuracy. These results illustrate the novel approach's potential for dealing with multicollinearity in IGRM. The continuous development of reliable estimating methods for generalized linear models (GLMs) is aided by these findings.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143925881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Feasibility Study on Identifying Seed Variety of Soybean With Hyperspectral Imaging and Deep Learning","authors":"Lei Pang, Zhen Wang, Siyan Mi, Hui Li","doi":"10.1002/cem.70035","DOIUrl":"https://doi.org/10.1002/cem.70035","url":null,"abstract":"<div>\u0000 \u0000 <p>Seed variety purity is an important indicator of seed quality, and mixing soybean seeds at different maturity stages can affect crop growth and food quality. This study investigated the feasibility of recognizing five soybean varieties at different maturity stages using hyperspectral imaging. Hyperspectral data from 3600 soybean seeds were collected in the range of 395.5–1003.7 nm. First, the potential to qualitatively distinguish the five soybean varieties was assessed using visual cluster analyses based on principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). Next, the performance of four classification models—random forest (RF), extreme learning machine (ELM), partial least squares discriminant analysis (PLS-DA), and one-dimensional convolutional neural network (1DCNN)—was compared. Multiplicative scatter correction (MSC) preprocessing significantly improved the recognition effect of all four models, with the 1DCNN model demonstrating the highest accuracy and most stable recognition performance. The effects of feature bands extracted using competitive adaptive reweighted sampling (CARS), variable importance in projection (VIP), and local linear embedding (LLE) on the four models were also compared. The accuracy of all four feature band sets, when combined with the MSC+1DCNN model, exceeded 96% in identifying soybean varieties. Therefore, these results indicate that the 1DCNN discriminant analysis model is suitable for spectral data analysis in soybean seed variety classification and can significantly enhance classification accuracy.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiview Ensemble Learning Framework for Real-Time UV Spectroscopic Detection of Nitrate in Water With Chemometric Modelling","authors":"Sagar Rana, Sudeshna Bagchi","doi":"10.1002/cem.70033","DOIUrl":"https://doi.org/10.1002/cem.70033","url":null,"abstract":"<div>\u0000 \u0000 <p>The accuracy of detection of nitrate in water for quality monitoring is a significant yet challenging task. To address this, the present work proposes an ensemble machine learning–based chemometric framework for the optical detection of nitrate in water. It incorporates an absorbance-based reagent-less detection of nitrate in water to support the robustness of the model. The absorption spectra were recorded using a portable set-up in the presence and absence of interfering ions. Different interfering ions, namely, nitrite (NO<sub>2</sub><sup>−</sup>), calcium (Ca<sup>2+</sup>), magnesium (Mg<sup>2+</sup>), carbonate (CO<sub>3</sub><sup>2−</sup>), bromide (Br<sup>−</sup>), chloride (Cl<sup>−</sup>) and phosphate (PO<sub>4</sub><sup>3−</sup>), in all possible combinations (binary, ternary, quaternary, quinary, senary and septenary mixtures) are added to target analyte to validate the real-time application of the proposed algorithm. Under the multiview framework, two models, MVNPM-I and MVNPM-II, i.e., multiview nitrate prediction models, are proposed. MVNPM-I is based on an ensemble of regressors' results, and MVNPM-II uses multiple views of the dataset followed by an ensemble of their results. The performance of the models is assessed using a hold-out validation scheme with 10 repetitions and measured using <i>R</i><sup>2</sup> score and mean squared error (MSE). The best results of <i>R</i><sup>2</sup> score 0.9978 with a standard deviation 0.0014 and MSE of 1.1799 with a standard deviation of 0.8639 are obtained using the MVNPM-II model. Further, the performance measures of the proposed models show that they can handle the presence of interfering ions. The algorithm was also tested using real-world samples with an <i>R</i><sup>2</sup> score and MSE of 0.9998 and 0.696, respectively. The promising results strengthen the applicability of the proposed method in real-world scenarios.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantitative Structure–Activity Relationship Modeling Based on Improving Kernel Ridge Regression","authors":"Shaimaa Waleed Mahmood, Ghalya Tawfeeq Basheer, Zakariya Yahya Algamal","doi":"10.1002/cem.70027","DOIUrl":"https://doi.org/10.1002/cem.70027","url":null,"abstract":"<div>\u0000 \u0000 <p>The quantitative structure–activity relationship (QSAR) as an effective and promising model to better understands the relationship between chemical activity and chemical compounds is usually used in modeling chemical datasets. Kernel ridge regression (KRR) has attracted the interest of scholars recently because of its non-iterative methodology for problem solving. KRR is a highly regarded and practical machine learning approach that has successfully tackled classification and regression issues. So is a regression method that uses a nonlinear kernel function to define an inner product in a higher-dimensional transformed space. This allows for generalization performance based on regularization least squares solution. However, the performance of KRR is affected by the choices of the values of the hyper-parameters that define the type of kernel. This has a major processing cost, uses memory, and is also accompanied by poor accuracy performance when studying the prior methods of determining these hyper-parameter values. Thus, the main highlighted enhancement in this paper is the enhancement of the coati optimization algorithm by applying elite opposite-based learning to increase the density of population around the search space to optima for the proper selection of the best hyperparameters. Thus, it is necessary to verify and compare its work with the proposed improvement of KRR in increasing its performance, seven public chemical datasets were used. Based on several assessment criteria, the results show that the proposed improvement is superior to all the baseline methods regarding the classification performance.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143892887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction to “Fast Partition-Based Cross-Validation With Centering and Scaling for XTX and XTY”","authors":"","doi":"10.1002/cem.70034","DOIUrl":"https://doi.org/10.1002/cem.70034","url":null,"abstract":"<p>\u0000 <span>Galbo Engstrøm, O.-C.</span> and <span>Holm Jensen, M.</span> (<span>2025</span>), <span>Fast Partition-Based Cross-Validation With Centering and Scaling for <b>X</b><sup><b>T</b></sup><b>X</b> and <b>X</b><sup><b>T</b></sup><b>Y</b></span>. <i>Journal of Chemometrics</i>, <span>39</span>: e70008, https://doi.org/10.1002/cem.70008.\u0000 </p><p>On line 27 in Algorithm 7 on page 10, the text to the right reads “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>csT</b></sup>” but should read “Obtain <b>X</b><sup><b>csT</b></sup><b>Y</b><sup><b>cs</b></sup>”.</p><p>In Proposition 15 on page 11, the last equality contains a double hat over <b>x</b><sub><b>s</b></sub><sup><b>T</b></sup>. It should have been a single hat.</p><p>We apologize for the confusion.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70034","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiBBKA: A Hybrid Method With Resampling and Heuristic Feature Selection for Class-Imbalanced Data in Chemometrics","authors":"Ying Guo, Ying Kou, Lun-Zhao Yi, Guang-Hui Fu","doi":"10.1002/cem.70029","DOIUrl":"https://doi.org/10.1002/cem.70029","url":null,"abstract":"<div>\u0000 \u0000 <p>In critical domains including medicinal chemistry, biomedicine, metabolomics, and computational toxicology, class imbalance in datasets and poor recognition accuracy for minority classes remain persistent challenges. While previous studies have employed resampling and feature selection techniques to address data imbalance and enhance classification performance, most approaches have focused on single-algorithm solutions rather than hybrid methodologies. Hybrid algorithms offer distinct advantages by integrating the strengths of multiple techniques, thereby providing more comprehensive and efficient solutions for handling imbalanced data. This study proposes HiBBKA, a novel hybrid algorithm combining radial-based under-sampling with SMOTE (RBU-SMOTE) and an improved binary black-winged kite algorithm (iBBKA) for feature selection. The proposed framework operates through two key phases: First, the RBU-SMOTE resampling method synergistically integrates radial-based under-sampling (RBU) with the synthetic minority oversampling technique (SMOTE), effectively addressing class-imbalance distribution while enhancing the quality of synthesized samples. Second, the enhanced iBBKA feature selection algorithm systematically identifies the most discriminative features critical for classification tasks. We comprehensively evaluate RBU-SMOTE and HiBBKA using multiple classifiers across 16 imbalanced datasets, including real-world medical datasets, with particular emphasis on the minority class performance. Experimental results demonstrate that RBU-SMOTE achieves competitive performance compared to existing resampling methods, while the complete HiBBKA framework significantly outperforms state-of-the-art algorithms in overall classification metrics, particularly in the minority class recognition.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 5","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143852899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radhika Khanna, Khushaboo Bhadoriya, Gaurav Pandey, V. K. Varshney
{"title":"Geographical Influence on Metabolite Profiles of Cupressus torulosa: UPLC-QTOF-MS (Positive Mode) and Chemometric Insights","authors":"Radhika Khanna, Khushaboo Bhadoriya, Gaurav Pandey, V. K. Varshney","doi":"10.1002/cem.70031","DOIUrl":"https://doi.org/10.1002/cem.70031","url":null,"abstract":"<div>\u0000 \u0000 <p><i>C. torulosa</i>, known as the Himalayan or Bhutan cypress, is a significant evergreen conifer that typically reaches heights between 20 and 45 m. This species is primarily found in the Himalayan regions of Bhutan, northern India, Nepal, and Tibet. In this study, we utilized ultra-performance liquid chromatography coupled with quadrupole time-of-flight mass spectrometry (UPLC-QTOF-MS) in positive ion mode, along with chemometric analysis, to investigate the metabolomic profiles of <i>C. torulosa</i> needles collected from 14 geographically distinct areas in Uttarakhand and Himachal Pradesh. Various statistical techniques, including ANOVA, Principal Component Analysis (PCA), Hierarchical Cluster Analysis (HCA), violin plots, scatter plots, box-and-whisker plots, and heatmaps, were employed to illustrate the relative quantitative differences among compounds based on their peak intensities across these regions. Our investigation revealed 34 marker compounds consistently detected across all samples (locations). These compounds were screened using rigorous filtering criteria, incorporating a moderated <i>t</i>-test and multiple testing adjustments using the Benjamini–Hochberg false discovery rate (FDR) approach. Furthermore, we pioneered the identification of the phenylpropanoid and flavonoid biosynthesis pathways in <i>C. torulosa</i>, providing new insights into its metabolic profile. This work establishes a foundational reference for future research into the species metabolome, helping guide studies in areas like genetic diversity, ecological adaptations, and climate resilience in <i>C. torulosa</i>. Mapping these pathways deepens scientific knowledge of <i>C. torulosa</i>'s metabolic processes, contributing to a clearer understanding of its unique biochemical makeup.</p>\u0000 </div>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143831301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}