{"title":"Marker genes identification and prediction of Parkinson's disease by integrating blood-based multi-omics data","authors":"Jisha Augustine, A.S. Jereesh","doi":"10.1016/j.chemolab.2025.105478","DOIUrl":"10.1016/j.chemolab.2025.105478","url":null,"abstract":"<div><div>Parkinson's disease (PD) is a rapidly progressing neurodegenerative disease marked by a combination of motor and non-motor symptoms. The molecular mechanism of PD remains unexplained, and there is currently no genetic risk factor with clinically proven reliability. Therefore, diagnosing PD has relied chiefly on analyzing brain images and clinical tests. Understanding the molecular-level mechanism of PD is challenging, primarily due to the complexities involved in sampling the posterior brains of both typical individuals and those with PD; however, several independent research have recently produced and assessed extensive omics data obtained from blood samples, making the diagnosis cheap and less invasive. Therefore, developing diagnostic or predictive methods for PD utilizing these data is necessary. In addition, integrating omics data can serve as a valuable asset for a comprehensive understanding of the disease. This research devised a computational approach to predict PD by integrating gene expression and DNA methylation datasets. The significant challenges were the high dimensionality and heterogeneous data sources. A two-level statistical approach is proposed to identify Differentially expressed and Methylated Genes. Archimedes Optimization Algorithm, a meta-heuristic algorithm, selects 17 optimal genes and 18 mapping CpG sites. A clustering-based method is proposed to integrate the heterogeneous omics data. Predictions of PD and healthy samples are performed using the Tabnet classification model. The proposed approach demonstrated an ROC-AUC of 0.7615 and an F1-score of 0.7325 on test data. The significance of our work is supported by biological analysis and assessment metrics.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"265 ","pages":"Article 105478"},"PeriodicalIF":3.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144580343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Uncertainty of predictions in absorption spectroscopy: Modelling with quantile regression forest","authors":"Alexandre M.J.-C. Wadoux , Leonardo Ramirez-Lopez","doi":"10.1016/j.chemolab.2025.105473","DOIUrl":"10.1016/j.chemolab.2025.105473","url":null,"abstract":"<div><div>Machine learning modelling is becoming popular for estimating agricultural and environmental properties from their infrared spectra. Commonly in modelling with machine learning and in commercial software applications, however, uncertainty estimates of the prediction are seldom reported. Uncertainty quantification of variables predicted with infrared spectroscopy is yet highly relevant in a number of applications, such as in uncertainty propagation analyses studies or for drug exposure detection. In this paper, we report on the development and application of quantile regression forest to predict properties from infrared spectroscopic data along with a sample-specific estimate of the uncertainty. Quantile regression forest is a machine learning algorithm that builds on random forest and provides estimate of the mean but also of the full conditional distribution of the predicted variable. We illustrate the algorithm with two chemometric applications and evaluate the modelling approach for its ability for predict the variable of interest and quantify the uncertainty. Evaluation involved usual validation statistics but also the validation of the uncertainty with the prediction interval coverage probability calculated for various interval widths. We tested prediction and prediction uncertainty quantification of two soil properties (cation exchange capacity and total organic carbon) as well as the dry matter of mango. The results confirm the potential of quantile regression forests for prediction and uncertainty quantification of properties predicted from infrared spectroscopy data. In all cases, the predictions were accurate and sample-specific estimates of the uncertainty were obtained. Validation of the uncertainty showed that the interval width was too large, thus overestimating the uncertainty for most intervals. Nevertheless, we recommend its use for operational applications as well as in future software developments, in particular when the data inferred by the spectroscopic model are used in other applications.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"265 ","pages":"Article 105473"},"PeriodicalIF":3.7,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144569895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"One criterion, two merits: A single-criterion-based sample selection method for informativeness and diversity","authors":"Zhongjiang He , Zhonghai He , Xiaofang Zhang","doi":"10.1016/j.chemolab.2025.105477","DOIUrl":"10.1016/j.chemolab.2025.105477","url":null,"abstract":"<div><div>In streaming batch-mode active learning process for data, sample selection typically involves two stages: informativeness measurement and similarity measurement. By analyzing the expression of model performance improvement induced by new samples, we identify a linear relationship between the performance gradient and the sample's vectors. Based on this finding, we propose a streaming batch active learning sample selection method, named One Criterion Two Merits (OCTM), which integrates informativeness and diversity measurement using a single criterion—the model improvement gradient. First, the model update gradient is computed for each incoming sample. Then, the magnitude of this gradient is used as an informativeness measure. Finally, the minimum angle between the new sample and buffer samples is calculated to quantify diversity. The threshold used for real-time decisions is critical in data stream scenarios, which traditionally relies on the assumption of a known threshold distribution. To address this issue, we propose a distribution-free threshold estimation method that determines the threshold based on the distribution of labeled samples. By sorting the measurement values and setting a confidence level, the threshold can be effectively computed.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105477"},"PeriodicalIF":3.7,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144549547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Graph residual based method for molecular property prediction","authors":"Kanad Sen , Saksham Gupta , Abhishek Raj , Alankar Alankar","doi":"10.1016/j.chemolab.2025.105471","DOIUrl":"10.1016/j.chemolab.2025.105471","url":null,"abstract":"<div><div>Machine learning-driven methods for chemical property prediction have been of deep interest. However, much work remains to be done to improve the generalization ability, accuracy, and inference time of critical applications. Traditional machine learning models predict properties based on the features extracted from the molecules, which are often not readily available. In this work, a novel deep learning method, the Edge Conditioned Residual Graph Neural Network (ECRGNN), has been applied, allowing us to predict properties directly only the Graph-based structures of the molecules. SMILES (Simplified Molecular Input Line Entry System) representation of the molecules has been used in the present study as input data format, which has been further converted into a graph database, constituting the training data. This article highlights a detailed description of the novel GRU (Gated Recurrent Unit) - based methodology, ECRGNN, to map the inputs that have been used. Emphasis is placed on highlighting both the regressive property and the classification efficacy of the same. A detailed description of the Variational Autoencoder (VAE) and the end-to-end learning method used for multi-class multi-label property prediction has also been provided. The results have been compared with standard benchmark datasets and some newly developed datasets. All performance metrics that have been used have been clearly defined, and their reason for choice.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"265 ","pages":"Article 105471"},"PeriodicalIF":3.7,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino
{"title":"Open Raman spectral library for biomolecule identification","authors":"Marcelo Terán , José Javier Ruiz , Pablo Loza-Alvarez , David Masip , David Merino","doi":"10.1016/j.chemolab.2025.105476","DOIUrl":"10.1016/j.chemolab.2025.105476","url":null,"abstract":"<div><div>Raman spectroscopy combined with Multivariate Curve Resolution (MCR) analysis is widely used in biomedical applications. However, assignation of biomolecules to the components extracted by MCR can be challenging due to the absence of an open Raman spectral library for biomolecules. Raman experts typically identify unmixed component spectra as biomolecules by comparing them with reference spectra from the literature. This process can be time-consuming and subject to human bias. In this work, we created an open Raman spectral database with 140 biomolecules by implementing an algorithm to digitalize the spectra plots and most relevant peaks from articles available in the literature. Additionally, we implemented two search algorithms. The first one uses the spectral linear kernel or cosine similarity on the full spectra. The second algorithm is based on peak matching, and relies on the intersection over the union of the matched peaks with a defined tolerance for peak matching. Our experimental validation showed 100 % top 10 accuracy in molecule identification (e.g. collagen) and 100 % accuracy in molecule type identification (e.g. protein) in both pure biomolecule measurements and also when replicating results from prior studies. Objectively narrowing the identification to the top 10 ranked candidates and providing type identification can significantly reduce both the time required for visual identification and the need to purchase reference component samples. We publish our spectral library as an open-source tool so it can be expanded collaboratively by the research community. It is available at: <span><span>https://github.com/mteranm/ramanbiolib</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105476"},"PeriodicalIF":3.7,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144557410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measurement uncertainty and risk of false decisions for similarity factor (f2): Boostrapping method or spreadsheet?","authors":"Jheniffer Rabelo, Felipe R. Lourenço","doi":"10.1016/j.chemolab.2025.105475","DOIUrl":"10.1016/j.chemolab.2025.105475","url":null,"abstract":"<div><div>The similarity factor (<em>f</em><sub><em>2</em></sub>) is a key metric to compare generic and reference drug and to obtain biowaivers of bioequivalence studies of Active Product Ingredient (API) in dissolution testing. Yet, its statistical limitations - including low power and undefined confidence levels - restrict its reliability. Therefore, Bootstrapping is a widely used approach for establishing Confidence Interval for <em>f</em><sub><em>2</em></sub>. Nevertheless, this approach is not user-friendly for non-statisticians, and to obtain the uncertainty related to the risk assessment of the <em>f</em><sub><em>2</em></sub> in the dissolution testing requires additional steps. In this investigation, we propose the Kragten spreadsheet method as a practical alternative to Bootstrapping approach in the evaluation of the consumer's risk. Comparative analysis was performed under six scenarios of API's, involving reference/test drugs and registered/new formulations. All six groups met regulatory criteria (<em>f</em><sub><em>2</em></sub>>50 %), while 95 % confidence intervals from both statistical methods showed agreement, confirming methodological reliability. Despite all groups achieved the <em>f</em><sub><em>2</em></sub>>50 %, the range obtained through Bootstrapping and Kragten methods determined that three scenarios (A, C, and E) presented elevated consumer risk (>5 %), highlighting limitations of <em>f</em><sub><em>2</em></sub> alone. Also, the differences between both methodologies were measured. For Bootstrapping, 50 iterations per 10.000 simulations showed no statistically significant differences from Kragten (<em>p</em> > 0.05), establishing method equivalence for symmetrical dissolution data. The findings advocate combining <em>f</em><sub><em>2</em></sub> with uncertainty analysis for risk assessment in dissolution testing.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105475"},"PeriodicalIF":3.7,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144516932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"(ϵ,δ)-differentially private partial least squares regression","authors":"Ramin Nikzad-Langerodi , Mohit Kumar , Du Nguyen Duy , Mahtab Alghasi","doi":"10.1016/j.chemolab.2025.105465","DOIUrl":"10.1016/j.chemolab.2025.105465","url":null,"abstract":"<div><div>As data-privacy requirements are becoming increasingly stringent and statistical models based on sensitive data are being deployed and used more routinely, protecting data-privacy becomes pivotal. Partial Least Squares (PLS) regression is the premier tool for building such models in analytical chemistry, yet it does not inherently provide privacy guarantees, leaving sensitive (training) data vulnerable to privacy attacks. To address this gap, we propose an <span><math><mrow><mo>(</mo><mi>ϵ</mi><mo>,</mo><mi>δ</mi><mo>)</mo></mrow></math></span>-differentially private PLS (edPLS) algorithm, which integrates well-studied and theoretically motivated Gaussian noise-adding mechanisms into the PLS algorithm to ensure the privacy of the data underlying the model. Our approach involves adding carefully calibrated Gaussian noise to the outputs of four key functions in the PLS algorithm: the weights, scores, <span><math><mi>X</mi></math></span>-loadings, and <span><math><mi>Y</mi></math></span>-loadings. The noise variance is determined based on the sensitivity of each function, ensuring that the privacy loss is controlled according to the <span><math><mrow><mo>(</mo><mi>ϵ</mi><mo>,</mo><mi>δ</mi><mo>)</mo></mrow></math></span>-differential privacy framework. Specifically, we derive the sensitivity bounds for each function and use these bounds to calibrate the noise added to the model components. Experimental results demonstrate that edPLS effectively renders privacy attacks, aimed at recovering unique sources of variability in the training data, ineffective. Application of edPLS to the NIR corn benchmark dataset shows that the root mean squared error of prediction (RMSEP) remains competitive even at strong privacy levels (i.e., <span><math><mrow><mi>ϵ</mi><mo>=</mo><mn>1</mn></mrow></math></span>), given proper pre-processing of the corresponding spectra. These findings highlight the practical utility of edPLS in creating privacy-preserving multivariate calibrations and for the analysis of their privacy-utility trade-offs.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105465"},"PeriodicalIF":3.7,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Domingo Martín , Germán Arroyo , Juan Carlos Torres , Luis López , María Rosario Blanc , Juan Ruiz de Miras
{"title":"A new segmentation-assisted interpolation method for creating maps in the study of artworks","authors":"Domingo Martín , Germán Arroyo , Juan Carlos Torres , Luis López , María Rosario Blanc , Juan Ruiz de Miras","doi":"10.1016/j.chemolab.2025.105466","DOIUrl":"10.1016/j.chemolab.2025.105466","url":null,"abstract":"<div><div>The generation of spatial distribution maps of chemical elements and compounds has become a crucial technique in materials research, particularly in the analysis of artworks. However, data acquisition in this context is often limited by the low number of measured points relative to the visual complexity of the artwork. As a result, interpolation methods are employed to infer unmeasured data. The most widely used method, Minimum Hypercube Distance (MHD), although statistically validated, exhibits significant limitations, as demonstrated in this study. We identified errors of up to 100% in some cases, exposing the method’s vulnerability in regions lacking sufficient data. To address these challenges, we propose a novel segmentation-assisted interpolation method. By integrating semantic segmentation, this approach improves the accuracy and interpretability of the resulting maps, allowing for the precise identification of unmeasured areas and the expert-guided replication of data from similar regions. This new methodology enhances the robustness of artwork analysis, providing more reliable tools for the study and preservation of artworks and ancient monuments.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105466"},"PeriodicalIF":3.7,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144491133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuta Wakutsu , Satoshi Natori , Hiroki Ochiai , Kazuya Suda , Hiromasa Kaneko
{"title":"Distortion prediction model considering process types in film manufacturing process and identification of critical process variables","authors":"Yuta Wakutsu , Satoshi Natori , Hiroki Ochiai , Kazuya Suda , Hiromasa Kaneko","doi":"10.1016/j.chemolab.2025.105474","DOIUrl":"10.1016/j.chemolab.2025.105474","url":null,"abstract":"<div><div>Optical films are used in flat panel displays, touch sensors and other devices. The film is wound and sent to the next process, but defects are generated in the winding, indicating an issue. Defects are often identified by customers after shipment. It is therefore anticipated that the identification of characteristics associated with defects at the end of the process will facilitate the detection of defective products in advance, or alternatively, result in a reduction of the defects themselves. One of the factors that contribute to the occurrence of defects is the distortion of the film on the roll surface. The degree of distortion is determined by calculating the difference between the instantaneous and average distance between the film surface and the sensor, as measured by the displacement meter installed before the winding. The objective of this study was to identify the process conditions that cause the distortion of the film as measured by the displacement meter through the application of machine learning techniques. A model was constructed between the sensor data of the process conditions and the distortion index, the relationship between them was identified, and the process causing the distortion was estimated by analyzing the model. The results of this study successfully narrowed down the process variables that are common causes of distortion among three displacement meters with different measurement positions.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105474"},"PeriodicalIF":3.7,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144491880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A likelihood ratio model for three-way data coupled with a Tucker3 model","authors":"Agnieszka Martyna , Eugenio Alladio , Monica Romagnoli , Fabrizio Malaspina , Marco Pazzi","doi":"10.1016/j.chemolab.2025.105464","DOIUrl":"10.1016/j.chemolab.2025.105464","url":null,"abstract":"<div><div>In forensic science, the analysis of diesel fuel is particularly important in fire investigations. The task is usually to compare the original accelerant found among the suspect’s belongings with the fire debris to find out if the fire could have been caused by the use of that particular diesel fuel (called source). The major problem when comparing the original accelerant with the fire debris is the weathering process of the accelerant taking place during the fire. The weathering process makes the composition of the accelerant change with the weathering state and may differ from the composition of the original accelerant. In this context the question arises if samples of the fire debris containing the accelerant weathered to different degrees are still so similar to the original accelerant that they can be regarded as coming from the same source (this particular accelerant) and whether samples of fire debris with accelerants from different sources are easily identified as such regardless of their weathering state. The hybrid likelihood ratio (LR) model which takes into account the information about the similarity and the frequency of observing the compared features in the samples was used for answering the above issues. Hybrid LR models use the new set of a limited number of variables that is generated using a variety of chemometric tools to summarise the data as well as possible and highlight the features that make each source of samples uniquely defined. The model was built for three-way GC–MS data of diesel fuel samples. Tucker3 model decomposed the three-dimensional array of the database into three matrices referring to GC, MS and samples (concentration) modes. The scores on the linear discriminant functions for the concentration mode served as an input for LR models. True origins for the majority of samples were indicated despite different weathering.</div></div>","PeriodicalId":9774,"journal":{"name":"Chemometrics and Intelligent Laboratory Systems","volume":"264 ","pages":"Article 105464"},"PeriodicalIF":3.7,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144514020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}