Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang
{"title":"基于多模型加权共识和蒙特卡罗交叉验证的离群点识别方法。","authors":"Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang","doi":"10.1093/jaoacint/qsaf061","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The accurate identification and removal of outliers are fundamental to the development of a robust model.</p><p><strong>Objective: </strong>Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.</p><p><strong>Methods: </strong>This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.</p><p><strong>Results: </strong>This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.</p><p><strong>Conclusion: </strong>The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.</p><p><strong>Highlights: </strong>A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.</p>","PeriodicalId":94064,"journal":{"name":"Journal of AOAC International","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation.\",\"authors\":\"Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang\",\"doi\":\"10.1093/jaoacint/qsaf061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The accurate identification and removal of outliers are fundamental to the development of a robust model.</p><p><strong>Objective: </strong>Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.</p><p><strong>Methods: </strong>This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.</p><p><strong>Results: </strong>This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.</p><p><strong>Conclusion: </strong>The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.</p><p><strong>Highlights: </strong>A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.</p>\",\"PeriodicalId\":94064,\"journal\":{\"name\":\"Journal of AOAC International\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of AOAC International\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jaoacint/qsaf061\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of AOAC International","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jaoacint/qsaf061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation.
Background: The accurate identification and removal of outliers are fundamental to the development of a robust model.
Objective: Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.
Methods: This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.
Results: This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.
Conclusion: The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.
Highlights: A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.