XYOnion: a layer-based method for splitting datasets into calibration and validation subsets

IF 5.7 2区化学 Q1 CHEMISTRY, ANALYTICAL

Analytica Chimica Acta Pub Date : 2025-05-21 DOI:10.1016/j.aca.2025.344229

Jokin Ezenarro

{"title":"XYOnion: a layer-based method for splitting datasets into calibration and validation subsets","authors":"Jokin Ezenarro","doi":"10.1016/j.aca.2025.344229","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Chemometric models are widely used in analytical chemistry to interpret complex multivariate data. However, they rely on robust validation, which typically involves splitting the dataset into calibration and validation subsets when an independent set is missing. Dataset splitting methods such as random splitting, Kennard-Stone, SPXY, and Onion each have limitations, including biased performance estimates or unbalanced subset distributions. This study introduces XYOnion, a novel algorithm that addresses these issues by combining the strengths of SPXY and Onion in a unified framework.</div></div><div><h3>Results</h3><div>XYOnion generates representative calibration and validation subsets by combining distances in both the predictor (X) and response (y) spaces, and assigning samples in layered shells based on this combined metric. This layered structure ensures balanced coverage across the entire data space and prevents extrapolation in the validation subset, leading to more robust model assessments. The algorithm was tested on both simulated and real datasets and systematically compared against commonly used methods such as random splitting, Kennard-Stone, SPXY, and Onion. Results demonstrate that XYOnion produces more realistic and stable figures of merit, effectively avoiding the overly optimistic performance estimates that arise from unbalanced or non-representative splits. Additionally, the incorporation of the DISTSLCT algorithm improves computational efficiency by eliminating the need to compute full pairwise distance matrices, thereby enhancing scalability and making XYOnion suitable for large and high-dimensional datasets encountered in chemometric applications.</div></div><div><h3>Significance and novelty</h3><div>In conclusion, XYOnion offers a practical and reliable approach for dataset splitting in situations where independent test sets are unavailable. By integrating distance information from both predictor and response variables with a layered sampling strategy, the method ensures balanced and representative calibration and validation subsets. This leads to more realistic, reproducible, and trustworthy model evaluations.</div></div>","PeriodicalId":240,"journal":{"name":"Analytica Chimica Acta","volume":"1364 ","pages":"Article 344229"},"PeriodicalIF":5.7000,"publicationDate":"2025-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytica Chimica Acta","FirstCategoryId":"92","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003267025006233","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, ANALYTICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Chemometric models are widely used in analytical chemistry to interpret complex multivariate data. However, they rely on robust validation, which typically involves splitting the dataset into calibration and validation subsets when an independent set is missing. Dataset splitting methods such as random splitting, Kennard-Stone, SPXY, and Onion each have limitations, including biased performance estimates or unbalanced subset distributions. This study introduces XYOnion, a novel algorithm that addresses these issues by combining the strengths of SPXY and Onion in a unified framework.

Results

XYOnion generates representative calibration and validation subsets by combining distances in both the predictor (X) and response (y) spaces, and assigning samples in layered shells based on this combined metric. This layered structure ensures balanced coverage across the entire data space and prevents extrapolation in the validation subset, leading to more robust model assessments. The algorithm was tested on both simulated and real datasets and systematically compared against commonly used methods such as random splitting, Kennard-Stone, SPXY, and Onion. Results demonstrate that XYOnion produces more realistic and stable figures of merit, effectively avoiding the overly optimistic performance estimates that arise from unbalanced or non-representative splits. Additionally, the incorporation of the DISTSLCT algorithm improves computational efficiency by eliminating the need to compute full pairwise distance matrices, thereby enhancing scalability and making XYOnion suitable for large and high-dimensional datasets encountered in chemometric applications.

Significance and novelty

In conclusion, XYOnion offers a practical and reliable approach for dataset splitting in situations where independent test sets are unavailable. By integrating distance information from both predictor and response variables with a layered sampling strategy, the method ensures balanced and representative calibration and validation subsets. This leads to more realistic, reproducible, and trustworthy model evaluations.

Abstract Image

查看原文本刊更多论文

XYOnion：一种基于层的方法，用于将数据集划分为校准和验证子集

化学计量模型在分析化学中被广泛用于解释复杂的多元数据。然而，它们依赖于鲁棒验证，这通常涉及在缺少独立集时将数据集分成校准和验证子集。随机分割、Kennard-Stone、SPXY和Onion等数据集分割方法都有局限性，包括有偏差的性能估计或不平衡的子集分布。本研究介绍了XYOnion，这是一种新颖的算法，通过将SPXY和Onion的优势结合在一个统一的框架中来解决这些问题。结果xyonion通过结合预测空间(X)和响应空间(y)的距离生成具有代表性的校准和验证子集，并基于该组合度量在分层壳中分配样本。这种分层结构确保了整个数据空间的均衡覆盖，并防止了验证子集中的外推，从而实现了更健壮的模型评估。该算法在模拟和真实数据集上进行了测试，并与常用的随机分割、Kennard-Stone、SPXY和Onion等方法进行了系统比较。结果表明，XYOnion产生了更真实和稳定的价值数字，有效地避免了由于不平衡或非代表性分割而产生的过于乐观的性能估计。此外，结合DISTSLCT算法，消除了计算全对距离矩阵的需要，从而提高了计算效率，从而增强了可扩展性，使XYOnion适用于化学计量学应用中遇到的大型高维数据集。总之，在没有独立测试集的情况下，XYOnion提供了一种实用可靠的数据集分割方法。通过分层采样策略整合预测变量和响应变量的距离信息，该方法确保了平衡和代表性的校准和验证子集。这将导致更现实的、可重复的和值得信赖的模型评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Analytica Chimica Acta 化学-分析化学

CiteScore

10.40

自引率

6.50%

发文量

1081

审稿时长

38 days

期刊介绍： Analytica Chimica Acta has an open access mirror journal Analytica Chimica Acta: X, sharing the same aims and scope, editorial team, submission system and rigorous peer review. Analytica Chimica Acta provides a forum for the rapid publication of original research, and critical, comprehensive reviews dealing with all aspects of fundamental and applied modern analytical chemistry. The journal welcomes the submission of research papers which report studies concerning the development of new and significant analytical methodologies. In determining the suitability of submitted articles for publication, particular scrutiny will be placed on the degree of novelty and impact of the research and the extent to which it adds to the existing body of knowledge in analytical chemistry.