Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation

IF 2.3 4区 化学 Q1 SOCIAL WORK
Péter Király, Gergely Tóth
{"title":"Being Aware of Data Leakage and Cross-Validation Scaling in Chemometric Model Validation","authors":"Péter Király,&nbsp;Gergely Tóth","doi":"10.1002/cem.70026","DOIUrl":null,"url":null,"abstract":"<p>Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.</p><p>The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [<span>1</span>] instead of using models in the double descent region for large datasets [<span>2-4</span>]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.</p><p>In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [<span>5</span>]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [<span>6</span>]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.</p><p>In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.</p><p>In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model performance assessment similar to overfitting, but their definition and the origin of the difficulties in validation are different. They can emerge independently; all combinations are possible, for example, strong overfitting without data leakage or lack of overfitting with strong data leakage. The common effect is that they reduce the effectiveness of model validation except the case of approaching the infinite limit of training and test set sizes for an optimally complex model. In this limit, the effect of data leakage and overfitting on the performance parameters goes to zero.</p><p>The classical leakage of cases is between the training set and the test set. An optimal test set is intended to be never used in the training process or in decisions concerning model selection or hyperparameter optimization. The test set should be representative for the intended application of the model. If the data set is large enough to be split into training and test sets and the latter is a good representative of the intended application field, the test set can be selected from the existing dataset before starting model building. If there will be a large variability in the intended applications and the starting dataset does not have this variability, an optimal test set or test sets should be obtained in new measurement campaigns. So, the independent test set can be obtained by splitting data before starting modeling or can be obtained later in new measurements. The sampling may follow two recipes, simple statistical sampling, when there is no preference of predictor or response ranges during the selection, or they can be designed by using different theories of sampling. The details of these possibilities are detailed, for example, in Ref. [<span>7</span>].</p><p>In the case of models having hyperparameters, the simplest train/test splitting is not enough. It is at least necessary to divide the training set into temporary training and validation sets [<span>26</span>]. In the simplest process, a model with given hyperparameters is parametrised on the temporary training set, and it is assessed on the validation set. The selection among the differently hyperparametrised models is based on the model performance on the validation set. The final model is usually reparametrized on the aggregated temporary training and validation sets. Data leakage between the temporary training and the validation sets is caused mostly at the aggregation. It causes a biased model selection in an inherent way.</p><p>There might be a further leakage in the hyperparameter optimization if one uses validation parameters there. If a given validation parameter is used in the selection of hyperparameters, this validation parameter is going to become overoptimistic in the final model in comparison to the other validation parameters. This effect we might call parameter leakage. This parameter leakage might be present in variable selection, too.</p><p>The OECD QSAR Guidance [<span>8</span>] categorizes the validation processes as internal and external ones. Internal means the calculation of validation parameters on the performance of the model using data, which were used in model building and in model selection. External validation means the calculation of validation parameters on a test set (on an optimal test set, as it is defined above). The single purpose of external validation should be to assess predictivity of the final model. The aims of the internal validation are to assess goodness-of-fit of the model on the training set and the robustness of the model. The latter is mostly managed by cross-validation methods or sometimes by bootstrap. The OECD guidance does not go into details about how hyperparameter optimization should be performed. The use of cross-validation does not remove the necessity of external (test) validation. Cross-validation has its role in hyperparameter optimization, especially, where there is no possibility to separate a validation set for this task. Furthermore, in the case of small datasets, cross-validation is also an approximative tool to guess on predictivity if it is impossible to have an independent test set. Anyway, we have to keep in mind the verdict of Ref. [<span>7</span>] that “cross-validation is only a suboptimal simulation of test set validation.”</p><p>The OECD guidance is seldomly taken into account in all parts, especially in the requirement of the independent test set. On the contrary, there is surprisingly a lot of emphasis put how to perform cross-validation in the literature of chemometrics in the last three decades [<span>9-12, 25</span>]. Different tasks are listed there, as model and hyperparameter selection and as variable selection, and often, it is used in order to get realistic estimation on the “predictive” power of the model. There is a debate in chemometrics on the contrary to the clear trend in data science that the predictive power could be determined using only cross-validation methods [<span>7, 13-19</span>]. There are several cross-validation procedures, whereof the repeated and double cross-validation methods provide stable validation parameters, despite the data leakage there [<span>20, 21, 12</span>]. In double cross-validation schemes (called sometimes nested ones), one of the iterations usually splits the data in “test” and validation+temporary training sets, but going into the details, one can find that these “test” sets do not fulfill the leakage free requirement mentioned earlier. Some developers justify the lack of real external test sets with the idea that “one test set is not a test set” due to the high variance of the validation parameters calculated on a single set [<span>23, 24</span>]. Anyhow, the optimal solution is to have one huge test set showing all variabilities of the intended applications. If it is not possible, a good solution is to have some independent test sets, for example, determined in different measurement campaigns in order to provide examples on the variability of later applications.</p><p>There are three main methods in nonnested cross-validation. There is a mismatch in their names causing misunderstanding on their power. In our previous studies, we followed the name convention of the OECD Guidance [<span>8</span>].</p><p>We call leave-one-out cross-validation the following parameter calculation process (LOO-CV): if <i>n</i><sub>train</sub> cases are used to optimize the basic model parameters, models on <i>n</i><sub>train</sub>-1 observations are developed. Altogether, <i>n</i><sub>train</sub> models are calculated, where finally, all cases are omitted only once. The validation parameters are calculated on all the <i>n</i><sub>trai<i>n</i></sub> cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process.</p><p>We call LMO-CV the calculation process (OECD), where a similar approach is applied as in the case of LOO-CV, but <i>m</i> cases are omitted once. Altogether, <i>n</i><sub>train</sub><i>/m</i> models are built in a way that each case is omitted only once. The validation parameter is calculated similarly on all the <i>n</i><sub>train</sub> cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process. This method is sometimes called <i>m</i>-fold cross-validation.</p><p>The third nonnested cross-validation splits the training set into a validation set having <i>n</i><sub><i>v</i></sub> elements and a distinct <i>n</i><sub><i>c</i></sub> = n<sub>train</sub>-<i>n</i><sub><i>v</i></sub> set on which the training of the temporary model is performed. The validation parameter is calculated on the <i>n</i><sub><i>v</i></sub> set. Usually, the splitting to <i>n</i><sub><i>v</i></sub> and <i>n</i><sub><i>c</i></sub> is repeated several times and the validation parameters are averaged on the repetitions. We call this procedure as repeated cross-validation (REP-CV). In the literature, several authors call it leave-multiple-out cross-validation or LMO-CV. In this study, we show results for these three cases denoting them LOO-CV, LMO-CV and REP-CV.</p><p>Please take care that for some authors, LMO-CV is called m-fold cross-validation. REP-CV is mentioned as hold-out [<span>7</span>], leave-n<sub>v</sub>-out [<span>25</span>], leave-multiple-out [<span>12</span>], leave-<i>p</i>-out (e.g., at Wikipedia) and leave M-out [<span>7</span>] cross validation with respect to the repetition number of the split from one to the maximal combinatoric possibilities. There is a last version not used by us, called Monte Carlo [<span>7</span>], where several splitting applied with close to random <i>n</i><sub>v</sub>-s and repetition numbers.</p><p>We used three modeling methods in our study: MLR, partial least squares regression (PLS), and artificial neural network (ANN). If it is not detailed in the text, the hyperparameters of the PLS and ANN models were identical to our previous studies [<span>5, 6, 26</span>]. The ANN models contained an input, a single hidden and an output layer. The optimization was mostly performed by the Adam algorithm [<span>27</span>], and logistic or tangent hyperbolic activation functions were used.</p><p>We used datasets collected from the literature and mostly used by us previously. The details of the datasets can be found in Table 1. The aim of this article is to show some selected case of data/parameter leakage and some calculations related to our LOO-CV/LMO-CV scaling law. We were interested in showing the existence of the effects, and we were not interested to quantify in general these effects. Therefore, we did not perform massive serial calculation on several datasets. We might show the effects on simulated data. In that case, one might create data sets having free of any other specialities over the required feature in the demonstration of, for example, data leakage. On the contrary, there is a persistent suspect on trends based on simulated data that the results were simple built in during the data simulation process. In order to avoid it, here, we concentrate on one dataset selected from the literature for each discussed topic. We do not detail, but for each case, there were some short calculations on other datasets, and we detected the effects there as well.</p><p>In most cases, we show the sample size dependence of the targeted behavior. Random samples were taken from the original datasets, and it was repeated 100–500 times for each sample size. Random train/test splitting was applied with 80/20 ratio. It resulted <i>n</i><sub>train</sub> and <i>n</i><sub>test</sub> cases in the sets. In these basic calculations, both sampling and splitting can be classified as statistical sampling without using consciously theory of sampling or design of experiment, for example, to select representative subsets for models. The different performance parameters were calculated on all of the samples, and their median values are mostly shown in the figures. We successfully applied this scheme of showing medians of repeated model building versus the training set size previously in order to show clear trends in validation. We think this method is more plausible and confident than showing only some values corresponding to one given sample size. In this article, we did not focus on the allocation of the predictor values detailed, for example, in design of experiment. The one exception is Section 3.2.</p><p>The calculations were performed using R and Python codes by using, for example, the reticulate and the scikit-learn packages [<span>35-37</span>].</p><p>Two groups of validation parameters were calculated: the extensive root mean square error (RMSE) group and the intensive coefficient of determination (<i>R</i><sup>2</sup>) ones. The parameters are defined in Table 2.</p><p>Our investigations focused on aspects emerged in the last years in data evaluation. We provided example calculations on two topics: data leakage and scaling of leave-one-out and LMO-CV parameters.</p><p>We show in our calculations that it is easy to introduce data leakage by inappropriate train/test splitting. Our leakage-free reference was random splitting. We showed that the Kennard–Stone method introduces significant data leakage between test and train sets. In the case of stratified sampling, we detected small changes compared to the random sampling, but these changes did not improve or worsen the models significantly and did not false the validation. It might be a good choice if one intends to include some theory of sampling over simple random statistical sampling. We showed that in the case of repeated measurements methods, one should take care to put all repetitions of the same object either into the test or to the training set.</p><p>We show examples of leakage that we call parameter leakage. In the first situation, the hyperparameters of a model were optimized with respect to a given cross-validation parameter. In the second case, variable selection was optimized with respect to a given cross-validation parameter. Thereafter, these validation parameters have become overoptimistic comparing to other validation parameters. Here, we warn to put less emphasis on the parameters used in hyperparameter or variable selection during the final assessment of the model. This is valid not only for the given parameter but for the whole family of parameters which correlate.</p><p>We extended our scaling law between leave-one-out and leave-many out cross-validation parameters to repeated-cross-validation parameters. We show on sample size dependence graphs that all these different cross-validation parameters can be scaled to each other by plotting the validation parameters with respect to the number of cases used in the temporary model fit procedures in cross-validation. We found previously that the fluctuation of the leave-one-out and leave-many-out parameters is the same on this graphical representation. In the case of REP-CV, we found slightly larger fluctuations if the number of the cross-validation blocks in the repetition equals to the number of folds in the leave-many-out process. As conclusion, here, we repeat our previous finding to use always the cheaper method of LOO and LMO.</p><p>There are several augmentations on the preference of leave-one-out or LMO-CV. One of the most cited articles is Shao’s from 1993 [<span>25</span>]. We discuss some misunderstanding in the interpretation of his results. We discuss that his theoretical results at different data limits have less relevance at practical data sizes. We show that his model calculations to compare leave-many-out and REP-CV parameters do not imply the superiority over the leave-one-out method if <i>m</i>-fold cross-validation or limited repeats in REP-CV are applied.</p><p>The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to show that precision is necessary in the interpretation of opinions found in the literature. Furthermore, we show that leave-one-out cross-validation might be preferential to leave many out cross-validation in some applications. It can be explained by the fluctuation intervals shown in our scaled LOO-LMO graphs.</p>","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.70026","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cem.70026","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}
引用次数: 0

Abstract

Chemometrics is one of the most elaborated data science fields. It was pioneering and still as is in the use of novel machine learning methods in several decades. The literature of chemometric modeling is enormous; there are several guidance, software, and other descriptions on how to perform careful analysis. On the other hand, the literature is often contradictory and inconsistent. There are many studies, where results on specific datasets are generalized without justification, and later, the generalized idea is cited without the original limits. In some cases, the difference in the nomenclature of methods causes misinterpretations. As at every field of science, there are also some preferences in the methods which bases on the strength of research groups without flexible and real scientific approach on the selection of the possibilities. There is also some inconsistency between the practical approach of chemometrics and the theoretical statistical theories, where often unrealistic assumptions and limits are studied.

The widely elaborated knowhow of chemometrics brings some rigidity to the field. There are some trends in data science to those ones chemometrics adapts slowly. An example is the exclusive thinking within the bias-variance trade-off model building [1] instead of using models in the double descent region for large datasets [2-4]. Another problematic question is data leakage. Chemometric models are built and often validated on data sets suffering data leakage up to now.

In our investigations, we met cases, where the huge literature background provided large inertia in the correction of misinterpretations. In 2021 we found, that leave-one-out and leave-many-out cross-validation (LMO-CV) parameters can be scaled to each other [5]. Furthermore, we showed that the two ways have around the same uncertainty in multiple linear regression (MLR) calculations [6]. Therefore, the choice among these methods should be the computation practice instead of preconceptions. We obtained some formal and informal criticism about omitting results of some well cited studies.

In this article, we present some examples to enhance rethinking on some traditional solutions in chemometrics. We show some calculations, how data leakage is there in chemometric tasks. Our other calculations focus on the scaling law in order to rehabilitate leave-one-out cross-validation.

In machine learning, data leakage means the use of an information during the model building, which biases the prediction assessment of the model, or will not be available during real predictive application of the model. A typical and easy to detect example is when cases very similar to training ones are present in the test set. There is a different form of leakage, when variables or classes are present in the explanatory variables that are too closely related to the response variables. Data leakage causes problems in model performance assessment similar to overfitting, but their definition and the origin of the difficulties in validation are different. They can emerge independently; all combinations are possible, for example, strong overfitting without data leakage or lack of overfitting with strong data leakage. The common effect is that they reduce the effectiveness of model validation except the case of approaching the infinite limit of training and test set sizes for an optimally complex model. In this limit, the effect of data leakage and overfitting on the performance parameters goes to zero.

The classical leakage of cases is between the training set and the test set. An optimal test set is intended to be never used in the training process or in decisions concerning model selection or hyperparameter optimization. The test set should be representative for the intended application of the model. If the data set is large enough to be split into training and test sets and the latter is a good representative of the intended application field, the test set can be selected from the existing dataset before starting model building. If there will be a large variability in the intended applications and the starting dataset does not have this variability, an optimal test set or test sets should be obtained in new measurement campaigns. So, the independent test set can be obtained by splitting data before starting modeling or can be obtained later in new measurements. The sampling may follow two recipes, simple statistical sampling, when there is no preference of predictor or response ranges during the selection, or they can be designed by using different theories of sampling. The details of these possibilities are detailed, for example, in Ref. [7].

In the case of models having hyperparameters, the simplest train/test splitting is not enough. It is at least necessary to divide the training set into temporary training and validation sets [26]. In the simplest process, a model with given hyperparameters is parametrised on the temporary training set, and it is assessed on the validation set. The selection among the differently hyperparametrised models is based on the model performance on the validation set. The final model is usually reparametrized on the aggregated temporary training and validation sets. Data leakage between the temporary training and the validation sets is caused mostly at the aggregation. It causes a biased model selection in an inherent way.

There might be a further leakage in the hyperparameter optimization if one uses validation parameters there. If a given validation parameter is used in the selection of hyperparameters, this validation parameter is going to become overoptimistic in the final model in comparison to the other validation parameters. This effect we might call parameter leakage. This parameter leakage might be present in variable selection, too.

The OECD QSAR Guidance [8] categorizes the validation processes as internal and external ones. Internal means the calculation of validation parameters on the performance of the model using data, which were used in model building and in model selection. External validation means the calculation of validation parameters on a test set (on an optimal test set, as it is defined above). The single purpose of external validation should be to assess predictivity of the final model. The aims of the internal validation are to assess goodness-of-fit of the model on the training set and the robustness of the model. The latter is mostly managed by cross-validation methods or sometimes by bootstrap. The OECD guidance does not go into details about how hyperparameter optimization should be performed. The use of cross-validation does not remove the necessity of external (test) validation. Cross-validation has its role in hyperparameter optimization, especially, where there is no possibility to separate a validation set for this task. Furthermore, in the case of small datasets, cross-validation is also an approximative tool to guess on predictivity if it is impossible to have an independent test set. Anyway, we have to keep in mind the verdict of Ref. [7] that “cross-validation is only a suboptimal simulation of test set validation.”

The OECD guidance is seldomly taken into account in all parts, especially in the requirement of the independent test set. On the contrary, there is surprisingly a lot of emphasis put how to perform cross-validation in the literature of chemometrics in the last three decades [9-12, 25]. Different tasks are listed there, as model and hyperparameter selection and as variable selection, and often, it is used in order to get realistic estimation on the “predictive” power of the model. There is a debate in chemometrics on the contrary to the clear trend in data science that the predictive power could be determined using only cross-validation methods [7, 13-19]. There are several cross-validation procedures, whereof the repeated and double cross-validation methods provide stable validation parameters, despite the data leakage there [20, 21, 12]. In double cross-validation schemes (called sometimes nested ones), one of the iterations usually splits the data in “test” and validation+temporary training sets, but going into the details, one can find that these “test” sets do not fulfill the leakage free requirement mentioned earlier. Some developers justify the lack of real external test sets with the idea that “one test set is not a test set” due to the high variance of the validation parameters calculated on a single set [23, 24]. Anyhow, the optimal solution is to have one huge test set showing all variabilities of the intended applications. If it is not possible, a good solution is to have some independent test sets, for example, determined in different measurement campaigns in order to provide examples on the variability of later applications.

There are three main methods in nonnested cross-validation. There is a mismatch in their names causing misunderstanding on their power. In our previous studies, we followed the name convention of the OECD Guidance [8].

We call leave-one-out cross-validation the following parameter calculation process (LOO-CV): if ntrain cases are used to optimize the basic model parameters, models on ntrain-1 observations are developed. Altogether, ntrain models are calculated, where finally, all cases are omitted only once. The validation parameters are calculated on all the ntrain cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process.

We call LMO-CV the calculation process (OECD), where a similar approach is applied as in the case of LOO-CV, but m cases are omitted once. Altogether, ntrain/m models are built in a way that each case is omitted only once. The validation parameter is calculated similarly on all the ntrain cases, but using for each only that model predicted value which was obtained in the model where the given case was not used in the training process. This method is sometimes called m-fold cross-validation.

The third nonnested cross-validation splits the training set into a validation set having nv elements and a distinct nc = ntrain-nv set on which the training of the temporary model is performed. The validation parameter is calculated on the nv set. Usually, the splitting to nv and nc is repeated several times and the validation parameters are averaged on the repetitions. We call this procedure as repeated cross-validation (REP-CV). In the literature, several authors call it leave-multiple-out cross-validation or LMO-CV. In this study, we show results for these three cases denoting them LOO-CV, LMO-CV and REP-CV.

Please take care that for some authors, LMO-CV is called m-fold cross-validation. REP-CV is mentioned as hold-out [7], leave-nv-out [25], leave-multiple-out [12], leave-p-out (e.g., at Wikipedia) and leave M-out [7] cross validation with respect to the repetition number of the split from one to the maximal combinatoric possibilities. There is a last version not used by us, called Monte Carlo [7], where several splitting applied with close to random nv-s and repetition numbers.

We used three modeling methods in our study: MLR, partial least squares regression (PLS), and artificial neural network (ANN). If it is not detailed in the text, the hyperparameters of the PLS and ANN models were identical to our previous studies [5, 6, 26]. The ANN models contained an input, a single hidden and an output layer. The optimization was mostly performed by the Adam algorithm [27], and logistic or tangent hyperbolic activation functions were used.

We used datasets collected from the literature and mostly used by us previously. The details of the datasets can be found in Table 1. The aim of this article is to show some selected case of data/parameter leakage and some calculations related to our LOO-CV/LMO-CV scaling law. We were interested in showing the existence of the effects, and we were not interested to quantify in general these effects. Therefore, we did not perform massive serial calculation on several datasets. We might show the effects on simulated data. In that case, one might create data sets having free of any other specialities over the required feature in the demonstration of, for example, data leakage. On the contrary, there is a persistent suspect on trends based on simulated data that the results were simple built in during the data simulation process. In order to avoid it, here, we concentrate on one dataset selected from the literature for each discussed topic. We do not detail, but for each case, there were some short calculations on other datasets, and we detected the effects there as well.

In most cases, we show the sample size dependence of the targeted behavior. Random samples were taken from the original datasets, and it was repeated 100–500 times for each sample size. Random train/test splitting was applied with 80/20 ratio. It resulted ntrain and ntest cases in the sets. In these basic calculations, both sampling and splitting can be classified as statistical sampling without using consciously theory of sampling or design of experiment, for example, to select representative subsets for models. The different performance parameters were calculated on all of the samples, and their median values are mostly shown in the figures. We successfully applied this scheme of showing medians of repeated model building versus the training set size previously in order to show clear trends in validation. We think this method is more plausible and confident than showing only some values corresponding to one given sample size. In this article, we did not focus on the allocation of the predictor values detailed, for example, in design of experiment. The one exception is Section 3.2.

The calculations were performed using R and Python codes by using, for example, the reticulate and the scikit-learn packages [35-37].

Two groups of validation parameters were calculated: the extensive root mean square error (RMSE) group and the intensive coefficient of determination (R2) ones. The parameters are defined in Table 2.

Our investigations focused on aspects emerged in the last years in data evaluation. We provided example calculations on two topics: data leakage and scaling of leave-one-out and LMO-CV parameters.

We show in our calculations that it is easy to introduce data leakage by inappropriate train/test splitting. Our leakage-free reference was random splitting. We showed that the Kennard–Stone method introduces significant data leakage between test and train sets. In the case of stratified sampling, we detected small changes compared to the random sampling, but these changes did not improve or worsen the models significantly and did not false the validation. It might be a good choice if one intends to include some theory of sampling over simple random statistical sampling. We showed that in the case of repeated measurements methods, one should take care to put all repetitions of the same object either into the test or to the training set.

We show examples of leakage that we call parameter leakage. In the first situation, the hyperparameters of a model were optimized with respect to a given cross-validation parameter. In the second case, variable selection was optimized with respect to a given cross-validation parameter. Thereafter, these validation parameters have become overoptimistic comparing to other validation parameters. Here, we warn to put less emphasis on the parameters used in hyperparameter or variable selection during the final assessment of the model. This is valid not only for the given parameter but for the whole family of parameters which correlate.

We extended our scaling law between leave-one-out and leave-many out cross-validation parameters to repeated-cross-validation parameters. We show on sample size dependence graphs that all these different cross-validation parameters can be scaled to each other by plotting the validation parameters with respect to the number of cases used in the temporary model fit procedures in cross-validation. We found previously that the fluctuation of the leave-one-out and leave-many-out parameters is the same on this graphical representation. In the case of REP-CV, we found slightly larger fluctuations if the number of the cross-validation blocks in the repetition equals to the number of folds in the leave-many-out process. As conclusion, here, we repeat our previous finding to use always the cheaper method of LOO and LMO.

There are several augmentations on the preference of leave-one-out or LMO-CV. One of the most cited articles is Shao’s from 1993 [25]. We discuss some misunderstanding in the interpretation of his results. We discuss that his theoretical results at different data limits have less relevance at practical data sizes. We show that his model calculations to compare leave-many-out and REP-CV parameters do not imply the superiority over the leave-one-out method if m-fold cross-validation or limited repeats in REP-CV are applied.

The main goal of our investigation is to raise awareness among chemometricians about how easy it is to introduce data or parameter leakage by inappropriate methods and to show that precision is necessary in the interpretation of opinions found in the literature. Furthermore, we show that leave-one-out cross-validation might be preferential to leave many out cross-validation in some applications. It can be explained by the fluctuation intervals shown in our scaled LOO-LMO graphs.

Abstract Image

化学计量学模型验证中的数据泄漏和交叉验证尺度问题
化学计量学是最复杂的数据科学领域之一。几十年来,它一直是使用新型机器学习方法的先驱。化学计量学建模的文献非常多;有一些关于如何执行仔细分析的指南、软件和其他描述。另一方面,文献往往是矛盾和不一致的。有许多研究,在特定数据集上的结果被一概而论而没有证明,后来,一概而论的想法被引用而没有原始的限制。在某些情况下,方法命名的差异会导致误解。在科学的每一个领域,也有一些偏好的方法,这是基于研究小组的力量,没有灵活和真正的科学方法的选择的可能性。在化学计量学的实际方法和理论统计理论之间也存在一些不一致,在理论统计理论中经常研究不切实际的假设和限制。广泛阐述的化学计量学知识给该领域带来了一些刚性。在数据科学中有一些趋势是化学计量学慢慢适应的。一个例子是在偏差-方差权衡模型构建[1]中的排他思维,而不是在大数据集的双下降区域中使用模型[2-4]。另一个问题是数据泄露。迄今为止,化学计量学模型的建立和验证往往是在数据泄露的数据集上进行的。在我们的调查中,我们遇到了一些案例,在这些案例中,巨大的文献背景为纠正误解提供了很大的惯性。在2021年,我们发现,留一和留多交叉验证(LMO-CV)参数可以相互缩放到[5]。此外,我们表明,这两种方法在多元线性回归(MLR)计算中具有大致相同的不确定性[6]。因此,在这些方法之间的选择应该是计算实践,而不是先入为主。我们因为遗漏了一些被广泛引用的研究的结果而受到了一些正式和非正式的批评。在本文中,我们提出了一些例子,以加强对化学计量学中一些传统解决方案的反思。我们展示了一些计算,数据泄漏是如何在化学计量任务中存在的。我们的其他计算集中在缩放定律上,以恢复留一交叉验证。在机器学习中,数据泄漏意味着在模型构建过程中使用信息,这会使模型的预测评估产生偏差,或者在模型的实际预测应用中不可用。一个典型且容易检测的例子是当测试集中存在与训练非常相似的情况时。当变量或类出现在与响应变量过于密切相关的解释变量中时,存在另一种形式的泄漏。数据泄漏在模型性能评估中引起类似过拟合的问题,但它们的定义和验证困难的来源不同。它们可以独立出现;所有的组合都是可能的,例如,没有数据泄漏的强过拟合或缺乏强数据泄漏的过拟合。常见的效果是,它们降低了模型验证的有效性,除了接近最优复杂模型的训练和测试集大小的无限限制的情况。在这个极限下,数据泄漏和过拟合对性能参数的影响趋于零。典型的案例泄漏发生在训练集和测试集之间。最优测试集的目的是永远不会在训练过程中使用,也不会在关于模型选择或超参数优化的决策中使用。测试集应该代表模型的预期应用。如果数据集足够大,可以分为训练集和测试集,并且后者很好地代表了预期的应用领域,则可以在开始模型构建之前从现有数据集中选择测试集。如果在预期的应用程序中有很大的可变性,而开始的数据集没有这种可变性,则应该在新的测量活动中获得一个或多个最佳测试集。因此,独立的测试集可以在开始建模之前通过拆分数据获得,也可以在以后的新测量中获得。抽样可以遵循两种方法,一种是简单的统计抽样,当在选择过程中没有对预测者或反应范围的偏好时,也可以使用不同的抽样理论来设计。例如,参考文献[7]详细介绍了这些可能性的细节。对于具有超参数的模型,最简单的训练/测试分割是不够的。至少需要将训练集划分为临时训练集和验证集[26]。 最简单的方法是在临时训练集上对给定超参数的模型进行参数化,并在验证集上对模型进行评估。在不同的超参数化模型之间的选择是基于模型在验证集上的性能。最终模型通常在聚合的临时训练集和验证集上重新参数化。临时训练集和验证集之间的数据泄漏主要发生在聚合阶段。它以一种固有的方式导致有偏见的模型选择。如果在超参数优化中使用验证参数,可能会有进一步的泄漏。如果在超参数的选择中使用了给定的验证参数,那么与其他验证参数相比,该验证参数在最终模型中会变得过于乐观。这种影响我们可以称之为参数泄漏。这种参数泄漏也可能出现在变量选择中。OECD QSAR指南[8]将验证过程分为内部和外部验证过程。内部是指利用数据计算模型性能的验证参数,这些参数用于模型构建和模型选择。外部验证意味着在测试集(如上所定义的最优测试集)上计算验证参数。外部验证的唯一目的应该是评估最终模型的可预测性。内部验证的目的是评估模型在训练集上的拟合优度和模型的鲁棒性。后者主要通过交叉验证方法来管理,有时也通过引导来管理。经合组织指南没有详细说明应该如何进行超参数优化。交叉验证的使用并不能消除外部(测试)验证的必要性。交叉验证在超参数优化中有其作用,特别是在不可能为该任务分离验证集的情况下。此外,在小数据集的情况下,如果不可能有独立的测试集,交叉验证也是猜测预测性的近似工具。无论如何,我们必须记住Ref.[7]的结论:“交叉验证只是测试集验证的次优模拟。”OECD的指导很少在各个方面都考虑到,特别是在独立测试集的要求方面。相反,在过去的三十年里,化学计量学的文献中有很多关于如何进行交叉验证的强调[9- 12,25]。这里列出了不同的任务,如模型和超参数的选择以及变量的选择,通常,它被用来对模型的“预测”能力进行现实的估计。与数据科学的明确趋势相反,化学计量学中存在一种争论,即预测能力只能通过交叉验证方法来确定[7,13 -19]。有几种交叉验证方法,其中重复和双重交叉验证方法提供了稳定的验证参数,尽管存在数据泄漏[20,21,12]。在双重交叉验证方案(有时称为嵌套方案)中,其中一个迭代通常将数据分成“测试”和验证+临时训练集,但深入细节,可以发现这些“测试”集并不能满足前面提到的无泄漏要求。一些开发人员用“一个测试集不是一个测试集”的想法来证明缺乏真正的外部测试集,因为在单个集上计算的验证参数的方差很大[23,24]。无论如何,最优的解决方案是使用一个巨大的测试集来显示预期应用程序的所有可变性。如果不可能,一个好的解决方案是使用一些独立的测试集,例如,在不同的测量活动中确定,以便为以后应用程序的可变性提供示例。在非嵌套交叉验证中有三种主要方法。他们的名字不匹配,导致人们对他们的权力产生误解。在我们之前的研究中,我们遵循了经合组织指导委员会的名称惯例。我们将留一交叉验证称为以下参数计算过程(LOO-CV):如果使用ntrain案例来优化基本模型参数,则建立ntrain-1观测值的模型。总的来说,我们计算了列车模型,最后,所有的情况只省略一次。在所有的训练案例上计算验证参数,但只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。我们称LMO-CV为计算过程(OECD),其中应用了与lo - cv类似的方法,但省略了m个案例。总的来说,在构建ntrain/m模型时,每种情况只被省略一次。 验证参数在所有的ntrain案例上计算相似,但是只使用在训练过程中没有使用给定案例的模型中获得的模型预测值。这种方法有时被称为m-fold交叉验证。第三种非嵌套交叉验证将训练集分成一个具有nv个元素的验证集和一个不同的nc = ntrain-nv集,在这个集上执行临时模型的训练。验证参数在nv集上计算。通常,分割为nv和nc要重复几次,验证参数在重复中取平均值。我们把这个过程称为重复交叉验证(REP-CV)。在文献中,一些作者将其称为遗漏多重交叉验证或LMO-CV。在本研究中,我们展示了这三种情况的结果,分别是lo - cv、LMO-CV和REP-CV。请注意,对于一些作者来说,LMO-CV被称为m-fold交叉验证。REP-CV被称为hold-out [7], leave-nv-out [25], leave-multiple-out [12], leave-p-out(例如,在Wikipedia)和leave M-out[7],涉及从1到最大组合可能性的分裂的重复次数。最后一个版本没有被我们使用,称为蒙特卡罗[7],其中几个分裂应用接近随机的n -s和重复数。我们在研究中使用了三种建模方法:MLR、偏最小二乘回归(PLS)和人工神经网络(ANN)。如果文中没有详细说明,PLS和ANN模型的超参数与我们之前的研究相同[5,6,26]。人工神经网络模型包含一个输入层、一个隐藏层和一个输出层。优化主要采用Adam算法[27],采用logistic或正切双曲激活函数。我们使用了从文献中收集的数据集,这些数据集大部分是我们以前使用的。数据集的详细信息可在表1中找到。本文的目的是展示一些选定的数据/参数泄漏案例,以及与我们的LOO-CV/LMO-CV标度定律相关的一些计算。我们感兴趣的是证明这些影响的存在性,而不是对这些影响的总体量化。因此,我们没有在多个数据集上执行大量的串行计算。我们可能会在模拟数据上显示其影响。在这种情况下,可能会创建没有任何其他特性的数据集,以演示所需的特性,例如数据泄漏。相反,对基于模拟数据的趋势有一种持续的怀疑,即结果是在数据模拟过程中简单内置的。为了避免它,在这里,我们专注于从每个讨论主题的文献中选择一个数据集。我们没有详细说明,但对于每种情况,我们都对其他数据集进行了一些简短的计算,我们也检测到了那里的影响。在大多数情况下,我们显示了目标行为的样本量依赖性。从原始数据集中随机抽取样本,每个样本量重复100-500次。随机训练/试验分割采用80/20比例。它在集合中产生了非训练和非测试情况。在这些基本计算中,抽样和分裂都可以归类为统计抽样,而不需要有意识地使用抽样理论或实验设计,例如为模型选择具有代表性的子集。在所有的样本上计算不同的性能参数,其中位数大多显示在图中。我们成功地应用了这种方案,显示重复模型构建的中位数与之前的训练集大小,以便在验证中显示明确的趋势。我们认为这种方法比只显示与给定样本量相对应的一些值更合理、更有信心。在本文中,我们没有详细讨论预测值的分配,例如在实验设计中。唯一的例外是第3.2节。计算使用R和Python代码执行,例如使用reticulate和scikit-learn包[35-37]。计算两组验证参数:广义均方根误差(RMSE)组和强化决定系数(R2)组。表2中定义了这些参数。我们的调查集中在过去几年数据评估中出现的方面。我们提供了两个主题的示例计算:数据泄漏和留一和LMO-CV参数的缩放。我们在计算中表明,不适当的训练/测试分割很容易引入数据泄漏。我们的无泄漏参考是随机分裂。我们发现肯纳德-斯通方法在测试集和训练集之间引入了显著的数据泄漏。 在分层抽样的情况下,与随机抽样相比,我们检测到很小的变化,但这些变化并没有显着改善或恶化模型,也没有错误的验证。如果一个人打算在简单的随机统计抽样中包含一些抽样理论,这可能是一个很好的选择。我们表明,在重复测量方法的情况下,应该注意将相同对象的所有重复放入测试或训练集中。我们展示了泄漏的例子,我们称之为参数泄漏。在第一种情况下,模型的超参数相对于给定的交叉验证参数进行优化。在第二种情况下,变量选择相对于给定的交叉验证参数进行了优化。此后,与其他验证参数相比,这些验证参数变得过于乐观。在这里,我们警告说,在模型的最终评估过程中,不要太强调超参数或变量选择中使用的参数。这不仅对给定的参数有效,而且对相关的整个参数族都有效。我们将留一个和留多个交叉验证参数之间的比例定律扩展到重复交叉验证参数。我们在样本量依赖图上显示,所有这些不同的交叉验证参数可以通过绘制验证参数与交叉验证中临时模型拟合过程中使用的案例数量相关联来相互缩放。我们以前发现,在这种图形表示中,留一和留多参数的波动是相同的。在REP-CV的情况下,我们发现,如果重复中的交叉验证块的数量等于leave-many-out过程中的折叠数量,则波动略大。作为结论,在这里,我们重复我们以前的发现,总是使用更便宜的LOO和LMO方法。对留一或LMO-CV的偏好有几种增强。被引用最多的文章之一是邵在1993年发表的文章。我们讨论在解释他的结果时的一些误解。我们讨论了他在不同数据限制下的理论结果与实际数据规模的相关性较小。我们表明,如果在REP-CV中应用m-fold交叉验证或有限重复,他的模型计算用于比较留多和REP-CV参数并不意味着优于留一方法。我们调查的主要目的是提高化学计量学家的意识,了解通过不适当的方法引入数据或参数泄漏是多么容易,并表明在解释文献中发现的意见时,精度是必要的。此外,我们还表明,在某些应用程序中,留一个交叉验证可能优先于留多个交叉验证。这可以用缩放后的lo - lmo图中所示的波动区间来解释。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Chemometrics
Journal of Chemometrics 化学-分析化学
CiteScore
5.20
自引率
8.30%
发文量
78
审稿时长
2 months
期刊介绍: The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信