{"title":"XYOnion: a layer-based method for splitting datasets into calibration and validation subsets","authors":"Jokin Ezenarro","doi":"10.1016/j.aca.2025.344229","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Chemometric models are widely used in analytical chemistry to interpret complex multivariate data. However, they rely on robust validation, which typically involves splitting the dataset into calibration and validation subsets when an independent set is missing. Dataset splitting methods such as random splitting, Kennard-Stone, SPXY, and Onion each have limitations, including biased performance estimates or unbalanced subset distributions. This study introduces XYOnion, a novel algorithm that addresses these issues by combining the strengths of SPXY and Onion in a unified framework.</div></div><div><h3>Results</h3><div>XYOnion generates representative calibration and validation subsets by combining distances in both the predictor (X) and response (y) spaces, and assigning samples in layered shells based on this combined metric. This layered structure ensures balanced coverage across the entire data space and prevents extrapolation in the validation subset, leading to more robust model assessments. The algorithm was tested on both simulated and real datasets and systematically compared against commonly used methods such as random splitting, Kennard-Stone, SPXY, and Onion. Results demonstrate that XYOnion produces more realistic and stable figures of merit, effectively avoiding the overly optimistic performance estimates that arise from unbalanced or non-representative splits. Additionally, the incorporation of the DISTSLCT algorithm improves computational efficiency by eliminating the need to compute full pairwise distance matrices, thereby enhancing scalability and making XYOnion suitable for large and high-dimensional datasets encountered in chemometric applications.</div></div><div><h3>Significance and novelty</h3><div>In conclusion, XYOnion offers a practical and reliable approach for dataset splitting in situations where independent test sets are unavailable. By integrating distance information from both predictor and response variables with a layered sampling strategy, the method ensures balanced and representative calibration and validation subsets. This leads to more realistic, reproducible, and trustworthy model evaluations.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1364 ","pages":"Article 344229"},"PeriodicalIF":5.7000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025006233","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Chemometric models are widely used in analytical chemistry to interpret complex multivariate data. However, they rely on robust validation, which typically involves splitting the dataset into calibration and validation subsets when an independent set is missing. Dataset splitting methods such as random splitting, Kennard-Stone, SPXY, and Onion each have limitations, including biased performance estimates or unbalanced subset distributions. This study introduces XYOnion, a novel algorithm that addresses these issues by combining the strengths of SPXY and Onion in a unified framework.
Results
XYOnion generates representative calibration and validation subsets by combining distances in both the predictor (X) and response (y) spaces, and assigning samples in layered shells based on this combined metric. This layered structure ensures balanced coverage across the entire data space and prevents extrapolation in the validation subset, leading to more robust model assessments. The algorithm was tested on both simulated and real datasets and systematically compared against commonly used methods such as random splitting, Kennard-Stone, SPXY, and Onion. Results demonstrate that XYOnion produces more realistic and stable figures of merit, effectively avoiding the overly optimistic performance estimates that arise from unbalanced or non-representative splits. Additionally, the incorporation of the DISTSLCT algorithm improves computational efficiency by eliminating the need to compute full pairwise distance matrices, thereby enhancing scalability and making XYOnion suitable for large and high-dimensional datasets encountered in chemometric applications.
Significance and novelty
In conclusion, XYOnion offers a practical and reliable approach for dataset splitting in situations where independent test sets are unavailable. By integrating distance information from both predictor and response variables with a layered sampling strategy, the method ensures balanced and representative calibration and validation subsets. This leads to more realistic, reproducible, and trustworthy model evaluations.
期刊介绍:
Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review.
Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.