基于多模型加权共识和蒙特卡罗交叉验证的离群点识别方法。

IF 1.7

Journal of AOAC International Pub Date : 2025-06-23 DOI:10.1093/jaoacint/qsaf061

Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang

{"title":"基于多模型加权共识和蒙特卡罗交叉验证的离群点识别方法。","authors":"Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang","doi":"10.1093/jaoacint/qsaf061","DOIUrl":null,"url":null,"abstract":"Background: The accurate identification and removal of outliers are fundamental to the development of a robust model.Objective: Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.Methods: This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.Results: This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.Conclusion: The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.Highlights: A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.","PeriodicalId":94064,"journal":{"name":"Journal of AOAC International","volume":" ","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation.\",\"authors\":\"Yujing Wang, Zhengguang Chen, Jinming Liu, He Wang\",\"doi\":\"10.1093/jaoacint/qsaf061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: The accurate identification and removal of outliers are fundamental to the development of a robust model.Objective: Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.Methods: This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.Results: This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.Conclusion: The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.Highlights: A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.\",\"PeriodicalId\":94064,\"journal\":{\"name\":\"Journal of AOAC International\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of AOAC International\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/jaoacint/qsaf061\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of AOAC International","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jaoacint/qsaf061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：准确识别和去除异常值是建立稳健模型的基础。然而，仅仅依靠单一模型进行异常值识别可能不足以准确识别所有异常值，可能导致假阳性、假阴性和模型依赖。方法：本研究引入了一种称为蒙特卡罗交叉验证的方法，该方法与多个模型加权共识相结合，用于异常值识别（MCWC）。该方法将蒙特卡罗随机抽样与三种不同的建模方法相结合：偏最小二乘回归（PLSR）、高斯过程回归（GPR）和支持向量回归（SVR）。这种整合允许合并来自每个模型的预测，从而有效地促进异常值的识别。结果：本研究采用305个高粱样本数据集作为实验基础。分别利用单模型法和MCWC法去除异常值后的数据建立高粱蛋白的预测模型。实验结果表明，采用单一建模方法去除离群点得到的数据集适合采用相同的方法进一步建模。然而，由于与模型依赖性相关的问题，它不适合与其他方法一起建模。应用MCWC方法去除异常值后，模型预测集的平均R2为0.8525。相比之下，采用蒙特卡罗方法单独结合PLSR去除离群值得到的模型预测集的平均R2为0.8037。结论：MCWC方法对近红外光谱异常值的识别精度较高，有效解决了近红外光谱异常值识别过程中存在的假阳性、假阴性、模型依赖等问题。这提高了光谱定量分析校准模型的整体预测性能。重点：提出了一种多模型动态加权共识离群值识别方法。这种动态加权方法有效地解决了简单平均可能出现的偏差。采用共识方法去除离群值后的数据更适合用更大范围的模型进行建模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Outlier identification method based on multi-model weighted consensus in conjunction with Monte Carlo Cross-Validation.

Background: The accurate identification and removal of outliers are fundamental to the development of a robust model.

Objective: Nevertheless, relying solely on a single model for outlier identification may prove inadequate for accurately identifying all outliers, potentially leading to false positives, false negatives, and model dependence.

Methods: This study introduces a method termed Monte Carlo cross-validation in conjunction with multiple models of Weighted Consensus for outlier identification (MCWC). The proposed method integrates Monte Carlo random sampling with three distinct modeling methods: Partial Least Squares Regression (PLSR), Gaussian Process Regression (GPR), and Support Vector Regression (SVR). This integration allows for the amalgamation of predictions from each model, facilitating the identification of outliers effectively.

Results: This study employed a dataset comprising 305 sorghum samples as the experimental foundation. The predictive model for sorghum protein was built using the data after removing outliers using the single model method and the MCWC method, respectively. The experimental results indicate that the dataset, which was obtained by removing outliers using a single modeling method, is appropriate for further modeling with the same method. However, it is not suitable for modeling with other methods due to issues related to model dependence. After applying the MCWC method to remove outliers, the average R2 of the model prediction set was found to be 0.8525. In contrast, the average R2 of the model prediction set, obtained by applying the Monte Carlo method combined exclusively with PLSR for outlier removal, is 0.8037.

Conclusion: The MCWC method exhibits superior accuracy in identifying outliers and effectively addresses challenges such as false positive, false negative, and model dependence in the process of identifying near-infrared spectral outliers. This enhances the overall predictive performance of the calibration model for spectral quantitative analysis.

Highlights: A multi-model dynamic weighted consensus outlier identification for NIRS data was proposed. This dynamic weighting method effectively addresses the biases that can occur with simple averaging. The data after removing outliers using consensus methods is more suitable for modeling with a wider range of models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of AOAC International

自引率

0.00%

发文量