{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király, Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":null,"url":null,"abstract":"<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model performance assessment similar to overfitting, but their definition and the origin of the difficulties in validation are different. They can emerge independently; all combinations are possible, for example, strong overfitting without data leakage or lack of overfitting with strong data leakage. The common effect is that they reduce the effectiveness of model validation except the case of approaching the infinite limit of training and test set sizes for an optimally complex model. In this limit, the effect of data leakage and overfitting on the performance parameters goes to zero.</p><p>The classical leakage of cases is between the training set and the test set. An optimal test set is intended to be never used in the training process or in decisions concerning model selection or hyperparameter optimization. The test set should be representative for the intended application of the model. If the data set is large enough to be split into training and test sets and the latter is a good representative of the intended application field, the test set can be selected from the existing dataset before starting model building. If there will be a large variability in the intended applications and the starting dataset does not have this variability, an optimal test set or test sets should be obtained in new measurement campaigns. So, the independent test set can be obtained by splitting data before starting modeling or can be obtained later in new measurements. The sampling may follow two recipes, simple statistical sampling, when there is no preference of predictor or response ranges during the selection, or they can be designed by using different theories of sampling. The details of these possibilities are detailed, for example, in Ref. [<span>7</span>].</p><p>In the case of models having hyperparameters, the simplest train/test splitting is not enough. It is at least necessary to divide the training set into temporary training and validation sets [<span>26</span>]. In the simplest process, a model with given hyperparameters is parametrised on the temporary training set, and it is assessed on the validation set. The selection among the differently hyperparametrised models is based on the model performance on the validation set. The final model is usually reparametrized on the aggregated temporary training and validation sets. Data leakage between the temporary training and the validation sets is caused mostly at the aggregation. It causes a biased model selection in an inherent way.</p><p>There might be a further leakage in the hyperparameter optimization if one uses validation parameters there. If a given validation parameter is used in the selection of hyperparameters, this validation parameter is going to become overoptimistic in the final model in comparison to the other validation parameters. This effect we might call parameter leakage. This parameter leakage might be present in variable selection, too.</p><p>The OECD QSAR Guidance [<span>8</span>] categorizes the validation processes as internal and external ones. Internal means the calculation of validation parameters on the performance of the model using data, which were used in model building and in model selection. External validation means the calculation of validation parameters on a test set (on an optimal test set, as it is defined above). The single purpose of external validation should be to assess predictivity of the final model. The aims of the internal validation are to assess goodness-of-fit of the model on the training set and the robustness of the model. The latter is mostly managed by cross-validation methods or sometimes by bootstrap. The OECD guidance does not go into details about how hyperparameter optimization should be performed. The use of cross-validation does not remove the necessity of external (test) validation. Cross-validation has its role in hyperparameter optimization, especially, where there is no possibility to separate a validation set for this task. Furthermore, in the case of small datasets, cross-validation is also an approximative tool to guess on predictivity if it is impossible to have an independent test set. Anyway, we have to keep in mind the verdict of Ref. [<span>7</span>] that “cross-validation is only a suboptimal simulation of test set validation.”</p><p>The OECD guidance is seldomly taken into account in all parts, especially in the requirement of the independent test set. On the contrary, there is surprisingly a lot of emphasis put how to perform cross-validation in the literature of chemometrics in the last three decades [<span>9-12, 25</span>]. Different tasks are listed there, as model and hyperparameter selection and as variable selection, and often, it is used in order to get realistic estimation on the “predictive” power of the model. There is a debate in chemometrics on the contrary to the clear trend in data science that the predictive power could be determined using only cross-validation methods [<span>7, 13-19</span>]. There are several cross-validation procedures, whereof the repeated and double cross-validation methods provide stable validation parameters, despite the data leakage there [<span>20, 21, 12</span>]. In double cross-validation schemes (called sometimes nested ones), one of the iterations usually splits the data in “test” and validation+temporary training sets, but going into the details, one can find that these “test” sets do not fulfill the leakage free requirement mentioned earlier. Some developers justify the lack of real external test sets with the idea that “one test set is not a test set” due to the high variance of the validation parameters calculated on a single set [<span>23, 24</span>]. Anyhow, the optimal solution is to have one huge test set showing all variabilities of the intended applications. If it is not possible, a good solution is to have some independent test sets, for example, determined in different measurement campaigns in order to provide examples on the variability of later applications.</p><p>There are three main methods in nonnested cross-validation. There is a mismatch in their names causing misunderstanding on their power. In our previous studies, we followed the name convention of the OECD Guidance [<span>8</span>].</p><p>We call leave-one-out cross-validation the following parameter calculation process (LOO-CV): if <i>n</i><sub>train</sub> cases are used to optimize the basic model parameters, models on <i>n</i><sub>train</sub>-1 observations are developed. Altogether, <i>n</i><sub>train</sub> models are calculated, where finally, all cases are omitted only once. The validation parameters are calculated on all the <i>n</i><sub>trai<i>n</i></sub> cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process.</p><p>We call LMO-CV the calculation process (OECD), where a similar approach is applied as in the case of LOO-CV, but <i>m</i> cases are omitted once. Altogether, <i>n</i><sub>train</sub><i>/m</i> models are built in a way that each case is omitted only once. The validation parameter is calculated similarly on all the <i>n</i><sub>train</sub> cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process. This method is sometimes called <i>m</i>-fold cross-validation.</p><p>The third nonnested cross-validation splits the training set into a validation set having <i>n</i><sub><i>v</i></sub> elements and a distinct <i>n</i><sub><i>c</i></sub> = n<sub>train</sub>-<i>n</i><sub><i>v</i></sub> set on which the training of the temporary model is performed. The validation parameter is calculated on the <i>n</i><sub><i>v</i></sub> set. Usually, the splitting to <i>n</i><sub><i>v</i></sub> and <i>n</i><sub><i>c</i></sub> is repeated several times and the validation parameters are averaged on the repetitions. We call this procedure as repeated cross-validation (REP-CV). In the literature, several authors call it leave-multiple-out cross-validation or LMO-CV. In this study, we show results for these three cases denoting them LOO-CV, LMO-CV and REP-CV.</p><p>Please take care that for some authors, LMO-CV is called m-fold cross-validation. REP-CV is mentioned as hold-out [<span>7</span>], leave-n<sub>v</sub>-out [<span>25</span>], leave-multiple-out [<span>12</span>], leave-<i>p</i>-out (e.g., at Wikipedia) and leave M-out [<span>7</span>] cross validation with respect to the repetition number of the split from one to the maximal combinatoric possibilities. There is a last version not used by us, called Monte Carlo [<span>7</span>], where several splitting applied with close to random <i>n</i><sub>v</sub>-s and repetition numbers.</p><p>We used three modeling methods in our study: MLR, partial least squares regression (PLS), and artificial neural network (ANN). If it is not detailed in the text, the hyperparameters of the PLS and ANN models were identical to our previous studies [<span>5, 6, 26</span>]. The ANN models contained an input, a single hidden and an output layer. The optimization was mostly performed by the Adam algorithm [<span>27</span>], and logistic or tangent hyperbolic activation functions were used.</p><p>We used datasets collected from the literature and mostly used by us previously. The details of the datasets can be found in Table 1. The aim of this article is to show some selected case of data/parameter leakage and some calculations related to our LOO-CV/LMO-CV scaling law. We were interested in showing the existence of the effects, and we were not interested to quantify in general these effects. Therefore, we did not perform massive serial calculation on several datasets. We might show the effects on simulated data. In that case, one might create data sets having free of any other specialities over the required feature in the demonstration of, for example, data leakage. On the contrary, there is a persistent suspect on trends based on simulated data that the results were simple built in during the data simulation process. In order to avoid it, here, we concentrate on one dataset selected from the literature for each discussed topic. We do not detail, but for each case, there were some short calculations on other datasets, and we detected the effects there as well.</p><p>In most cases, we show the sample size dependence of the targeted behavior. Random samples were taken from the original datasets, and it was repeated 100–500 times for each sample size. Random train/test splitting was applied with 80/20 ratio. It resulted <i>n</i><sub>train</sub> and <i>n</i><sub>test</sub> cases in the sets. In these basic calculations, both sampling and splitting can be classified as statistical sampling without using consciously theory of sampling or design of experiment, for example, to select representative subsets for models. The different performance parameters were calculated on all of the samples, and their median values are mostly shown in the figures. We successfully applied this scheme of showing medians of repeated model building versus the training set size previously in order to show clear trends in validation. We think this method is more plausible and confident than showing only some values corresponding to one given sample size. In this article, we did not focus on the allocation of the predictor values detailed, for example, in design of experiment. The one exception is Section 3.2.</p><p>The calculations were performed using R and Python codes by using, for example, the reticulate and the scikit-learn packages [<span>35-37</span>].</p><p>Two groups of validation parameters were calculated: the extensive root mean square error (RMSE) group and the intensive coefficient of determination (<i>R</i><sup>2</sup>) ones. The parameters are defined in Table 2.</p><p>Our investigations focused on aspects emerged in the last years in data evaluation. We provided example calculations on two topics: data leakage and scaling of leave-one-out and LMO-CV parameters.</p><p>We show in our calculations that it is easy to introduce data leakage by inappropriate train/test splitting. Our leakage-free reference was random splitting. We showed that the Kennard–Stone method introduces significant data leakage between test and train sets. In the case of stratified sampling, we detected small changes compared to the random sampling, but these changes did not improve or worsen the models significantly and did not false the validation. It might be a good choice if one intends to include some theory of sampling over simple random statistical sampling. We showed that in the case of repeated measurements methods, one should take care to put all repetitions of the same object either into the test or to the training set.</p><p>We show examples of leakage that we call parameter leakage. In the first situation, the hyperparameters of a model were optimized with respect to a given cross-validation parameter. In the second case, variable selection was optimized with respect to a given cross-validation parameter. Thereafter, these validation parameters have become overoptimistic comparing to other validation parameters. Here, we warn to put less emphasis on the parameters used in hyperparameter or variable selection during the final assessment of the model. This is valid not only for the given parameter but for the whole family of parameters which correlate.</p><p>We extended our scaling law between leave-one-out and leave-many out cross-validation parameters to repeated-cross-validation parameters. We show on sample size dependence graphs that all these different cross-validation parameters can be scaled to each other by plotting the validation parameters with respect to the number of cases used in the temporary model fit procedures in cross-validation. We found previously that the fluctuation of the leave-one-out and leave-many-out parameters is the same on this graphical representation. In the case of REP-CV, we found slightly larger fluctuations if the number of the cross-validation blocks in the repetition equals to the number of folds in the leave-many-out process. As conclusion, here, we repeat our previous finding to use always the cheaper method of LOO and LMO.</p><p>There are several augmentations on the preference of leave-one-out or LMO-CV. One of the most cited articles is Shao’s from 1993 [<span>25</span>]. We discuss some misunderstanding in the interpretation of his results. We discuss that his theoretical results at different data limits have less relevance at practical data sizes. We show that his model calculations to compare leave-many-out and REP-CV parameters do not imply the superiority over the leave-one-out method if <i>m</i>-fold cross-validation or limited repeats in REP-CV are applied.</p><p>The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to show that precision is necessary in the interpretation of opinions found in the literature. Furthermore, we show that leave-one-out cross-validation might be preferential to leave many out cross-validation in some applications. It can be explained by the fluctuation intervals shown in our scaled LOO-LMO graphs.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cem.70026","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}
引用次数: 0
Abstract
Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.
The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [1] instead of using models in the double descent region for large datasets [2-4]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.
In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [5]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [6]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.
In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.
In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model performance assessment similar to overfitting, but their definition and the origin of the difficulties in validation are different. They can emerge independently; all combinations are possible, for example, strong overfitting without data leakage or lack of overfitting with strong data leakage. The common effect is that they reduce the effectiveness of model validation except the case of approaching the infinite limit of training and test set sizes for an optimally complex model. In this limit, the effect of data leakage and overfitting on the performance parameters goes to zero.
The classical leakage of cases is between the training set and the test set. An optimal test set is intended to be never used in the training process or in decisions concerning model selection or hyperparameter optimization. The test set should be representative for the intended application of the model. If the data set is large enough to be split into training and test sets and the latter is a good representative of the intended application field, the test set can be selected from the existing dataset before starting model building. If there will be a large variability in the intended applications and the starting dataset does not have this variability, an optimal test set or test sets should be obtained in new measurement campaigns. So, the independent test set can be obtained by splitting data before starting modeling or can be obtained later in new measurements. The sampling may follow two recipes, simple statistical sampling, when there is no preference of predictor or response ranges during the selection, or they can be designed by using different theories of sampling. The details of these possibilities are detailed, for example, in Ref. [7].
In the case of models having hyperparameters, the simplest train/test splitting is not enough. It is at least necessary to divide the training set into temporary training and validation sets [26]. In the simplest process, a model with given hyperparameters is parametrised on the temporary training set, and it is assessed on the validation set. The selection among the differently hyperparametrised models is based on the model performance on the validation set. The final model is usually reparametrized on the aggregated temporary training and validation sets. Data leakage between the temporary training and the validation sets is caused mostly at the aggregation. It causes a biased model selection in an inherent way.
There might be a further leakage in the hyperparameter optimization if one uses validation parameters there. If a given validation parameter is used in the selection of hyperparameters, this validation parameter is going to become overoptimistic in the final model in comparison to the other validation parameters. This effect we might call parameter leakage. This parameter leakage might be present in variable selection, too.
The OECD QSAR Guidance [8] categorizes the validation processes as internal and external ones. Internal means the calculation of validation parameters on the performance of the model using data, which were used in model building and in model selection. External validation means the calculation of validation parameters on a test set (on an optimal test set, as it is defined above). The single purpose of external validation should be to assess predictivity of the final model. The aims of the internal validation are to assess goodness-of-fit of the model on the training set and the robustness of the model. The latter is mostly managed by cross-validation methods or sometimes by bootstrap. The OECD guidance does not go into details about how hyperparameter optimization should be performed. The use of cross-validation does not remove the necessity of external (test) validation. Cross-validation has its role in hyperparameter optimization, especially, where there is no possibility to separate a validation set for this task. Furthermore, in the case of small datasets, cross-validation is also an approximative tool to guess on predictivity if it is impossible to have an independent test set. Anyway, we have to keep in mind the verdict of Ref. [7] that “cross-validation is only a suboptimal simulation of test set validation.”
The OECD guidance is seldomly taken into account in all parts, especially in the requirement of the independent test set. On the contrary, there is surprisingly a lot of emphasis put how to perform cross-validation in the literature of chemometrics in the last three decades [9-12, 25]. Different tasks are listed there, as model and hyperparameter selection and as variable selection, and often, it is used in order to get realistic estimation on the “predictive” power of the model. There is a debate in chemometrics on the contrary to the clear trend in data science that the predictive power could be determined using only cross-validation methods [7, 13-19]. There are several cross-validation procedures, whereof the repeated and double cross-validation methods provide stable validation parameters, despite the data leakage there [20, 21, 12]. In double cross-validation schemes (called sometimes nested ones), one of the iterations usually splits the data in “test” and validation+temporary training sets, but going into the details, one can find that these “test” sets do not fulfill the leakage free requirement mentioned earlier. Some developers justify the lack of real external test sets with the idea that “one test set is not a test set” due to the high variance of the validation parameters calculated on a single set [23, 24]. Anyhow, the optimal solution is to have one huge test set showing all variabilities of the intended applications. If it is not possible, a good solution is to have some independent test sets, for example, determined in different measurement campaigns in order to provide examples on the variability of later applications.
There are three main methods in nonnested cross-validation. There is a mismatch in their names causing misunderstanding on their power. In our previous studies, we followed the name convention of the OECD Guidance [8].
We call leave-one-out cross-validation the following parameter calculation process (LOO-CV): if ntrain cases are used to optimize the basic model parameters, models on ntrain-1 observations are developed. Altogether, ntrain models are calculated, where finally, all cases are omitted only once. The validation parameters are calculated on all the ntrain cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process.
We call LMO-CV the calculation process (OECD), where a similar approach is applied as in the case of LOO-CV, but m cases are omitted once. Altogether, ntrain/m models are built in a way that each case is omitted only once. The validation parameter is calculated similarly on all the ntrain cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process. This method is sometimes called m-fold cross-validation.
The third nonnested cross-validation splits the training set into a validation set having nv elements and a distinct nc = ntrain-nv set on which the training of the temporary model is performed. The validation parameter is calculated on the nv set. Usually, the splitting to nv and nc is repeated several times and the validation parameters are averaged on the repetitions. We call this procedure as repeated cross-validation (REP-CV). In the literature, several authors call it leave-multiple-out cross-validation or LMO-CV. In this study, we show results for these three cases denoting them LOO-CV, LMO-CV and REP-CV.
Please take care that for some authors, LMO-CV is called m-fold cross-validation. REP-CV is mentioned as hold-out [7], leave-nv-out [25], leave-multiple-out [12], leave-p-out (e.g., at Wikipedia) and leave M-out [7] cross validation with respect to the repetition number of the split from one to the maximal combinatoric possibilities. There is a last version not used by us, called Monte Carlo [7], where several splitting applied with close to random nv-s and repetition numbers.
We used three modeling methods in our study: MLR, partial least squares regression (PLS), and artificial neural network (ANN). If it is not detailed in the text, the hyperparameters of the PLS and ANN models were identical to our previous studies [5, 6, 26]. The ANN models contained an input, a single hidden and an output layer. The optimization was mostly performed by the Adam algorithm [27], and logistic or tangent hyperbolic activation functions were used.
We used datasets collected from the literature and mostly used by us previously. The details of the datasets can be found in Table 1. The aim of this article is to show some selected case of data/parameter leakage and some calculations related to our LOO-CV/LMO-CV scaling law. We were interested in showing the existence of the effects, and we were not interested to quantify in general these effects. Therefore, we did not perform massive serial calculation on several datasets. We might show the effects on simulated data. In that case, one might create data sets having free of any other specialities over the required feature in the demonstration of, for example, data leakage. On the contrary, there is a persistent suspect on trends based on simulated data that the results were simple built in during the data simulation process. In order to avoid it, here, we concentrate on one dataset selected from the literature for each discussed topic. We do not detail, but for each case, there were some short calculations on other datasets, and we detected the effects there as well.
In most cases, we show the sample size dependence of the targeted behavior. Random samples were taken from the original datasets, and it was repeated 100–500 times for each sample size. Random train/test splitting was applied with 80/20 ratio. It resulted ntrain and ntest cases in the sets. In these basic calculations, both sampling and splitting can be classified as statistical sampling without using consciously theory of sampling or design of experiment, for example, to select representative subsets for models. The different performance parameters were calculated on all of the samples, and their median values are mostly shown in the figures. We successfully applied this scheme of showing medians of repeated model building versus the training set size previously in order to show clear trends in validation. We think this method is more plausible and confident than showing only some values corresponding to one given sample size. In this article, we did not focus on the allocation of the predictor values detailed, for example, in design of experiment. The one exception is Section 3.2.
The calculations were performed using R and Python codes by using, for example, the reticulate and the scikit-learn packages [35-37].
Two groups of validation parameters were calculated: the extensive root mean square error (RMSE) group and the intensive coefficient of determination (R2) ones. The parameters are defined in Table 2.
Our investigations focused on aspects emerged in the last years in data evaluation. We provided example calculations on two topics: data leakage and scaling of leave-one-out and LMO-CV parameters.
We show in our calculations that it is easy to introduce data leakage by inappropriate train/test splitting. Our leakage-free reference was random splitting. We showed that the Kennard–Stone method introduces significant data leakage between test and train sets. In the case of stratified sampling, we detected small changes compared to the random sampling, but these changes did not improve or worsen the models significantly and did not false the validation. It might be a good choice if one intends to include some theory of sampling over simple random statistical sampling. We showed that in the case of repeated measurements methods, one should take care to put all repetitions of the same object either into the test or to the training set.
We show examples of leakage that we call parameter leakage. In the first situation, the hyperparameters of a model were optimized with respect to a given cross-validation parameter. In the second case, variable selection was optimized with respect to a given cross-validation parameter. Thereafter, these validation parameters have become overoptimistic comparing to other validation parameters. Here, we warn to put less emphasis on the parameters used in hyperparameter or variable selection during the final assessment of the model. This is valid not only for the given parameter but for the whole family of parameters which correlate.
We extended our scaling law between leave-one-out and leave-many out cross-validation parameters to repeated-cross-validation parameters. We show on sample size dependence graphs that all these different cross-validation parameters can be scaled to each other by plotting the validation parameters with respect to the number of cases used in the temporary model fit procedures in cross-validation. We found previously that the fluctuation of the leave-one-out and leave-many-out parameters is the same on this graphical representation. In the case of REP-CV, we found slightly larger fluctuations if the number of the cross-validation blocks in the repetition equals to the number of folds in the leave-many-out process. As conclusion, here, we repeat our previous finding to use always the cheaper method of LOO and LMO.
There are several augmentations on the preference of leave-one-out or LMO-CV. One of the most cited articles is Shao’s from 1993 [25]. We discuss some misunderstanding in the interpretation of his results. We discuss that his theoretical results at different data limits have less relevance at practical data sizes. We show that his model calculations to compare leave-many-out and REP-CV parameters do not imply the superiority over the leave-one-out method if m-fold cross-validation or limited repeats in REP-CV are applied.
The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to show that precision is necessary in the interpretation of opinions found in the literature. Furthermore, we show that leave-one-out cross-validation might be preferential to leave many out cross-validation in some applications. It can be explained by the fluctuation intervals shown in our scaled LOO-LMO graphs.
期刊介绍:
The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.