Bayesian and partial least square global forage calibrations models developed by an iterative procedure using R

Proceedings of the 18th International Conference on Near Infrared Spectroscopy Pub Date : 2019-02-19 DOI:10.1255/NIR2017.051

A. Ferragina, F. Benozzo, P. Berzaghi

{"title":"Bayesian and partial least square global forage calibrations models developed by an iterative procedure using\nR","authors":"A. Ferragina, F. Benozzo, P. Berzaghi","doi":"10.1255/NIR2017.051","DOIUrl":null,"url":null,"abstract":"Author Summary: The aim of our study was to test an iterative process of validation implemented in the R software, assessing the accuracy of the best selected equations, developed using two different regression algorithms Partial Least Square (PLS) and Bayesian. A data set (Seta) with 3187 records of 6 different types of forages was used. The calibrations were tested for Protein, Neutral Detergent Fiber and Acid Detergent Fiber. For each sample a spectrum was collected using a FOSS NIRSystem (1100–2498 nm). A subset composed of 20 samples for each type of forage (Setext;120 samples) was randomly selected for a final validation of the best selected equations. The remaining samples (Setb = Seta – Setext) were used for the iterative calibration process. For each iteration the Setb was randomly divided in a testing set (Settst; 10 % of Setb) and a training set (Settrn = Setb – Settst); 300 iterations were done. All of the computations were done in the R environment. The packages used were “pls” for the PLS, “BGLR” for the Bayesian, “prospectr” for the spectral treatments. In each iteration we used three spectral treatments (raw, 1 derivative, standard normal variate and detrend), two approaches for selection of the optimal number of PLS components and the Bayesian model. Nine types of equations were developed and tested in each iteration [(2 PLS techniques + 1 Bayesian) × 3 spectral treatments]. Among the 300 iterations, for each one of the 9 equation types, the best one (lowest RMSE) and the average of the best 25 % (RMSE < 1 quartile) were selected and validated by forage type. R has demonstrated its potential when used for the chemiometric process on big data set and with complex statistical procedures. R2 higher than 0.9 was obtained for almost all the calibrations. In the external validation the Bayesian models in many cases outperform the commonly used PLS, demonstrating that an alternative for the improvement of the prediction accuracy exists. The present work has demonstrated that iterative validation subsampling on big data can lead to the selection of proper equations, and it can be done using R.","PeriodicalId":20429,"journal":{"name":"Proceedings of the 18th International Conference on Near Infrared Spectroscopy","volume":"46 3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2019-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th International Conference on Near Infrared Spectroscopy","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1255/NIR2017.051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Author Summary: The aim of our study was to test an iterative process of validation implemented in the R software, assessing the accuracy of the best selected equations, developed using two different regression algorithms Partial Least Square (PLS) and Bayesian. A data set (Seta) with 3187 records of 6 different types of forages was used. The calibrations were tested for Protein, Neutral Detergent Fiber and Acid Detergent Fiber. For each sample a spectrum was collected using a FOSS NIRSystem (1100–2498 nm). A subset composed of 20 samples for each type of forage (Setext;120 samples) was randomly selected for a final validation of the best selected equations. The remaining samples (Setb = Seta – Setext) were used for the iterative calibration process. For each iteration the Setb was randomly divided in a testing set (Settst; 10 % of Setb) and a training set (Settrn = Setb – Settst); 300 iterations were done. All of the computations were done in the R environment. The packages used were “pls” for the PLS, “BGLR” for the Bayesian, “prospectr” for the spectral treatments. In each iteration we used three spectral treatments (raw, 1 derivative, standard normal variate and detrend), two approaches for selection of the optimal number of PLS components and the Bayesian model. Nine types of equations were developed and tested in each iteration [(2 PLS techniques + 1 Bayesian) × 3 spectral treatments]. Among the 300 iterations, for each one of the 9 equation types, the best one (lowest RMSE) and the average of the best 25 % (RMSE < 1 quartile) were selected and validated by forage type. R has demonstrated its potential when used for the chemiometric process on big data set and with complex statistical procedures. R2 higher than 0.9 was obtained for almost all the calibrations. In the external validation the Bayesian models in many cases outperform the commonly used PLS, demonstrating that an alternative for the improvement of the prediction accuracy exists. The present work has demonstrated that iterative validation subsampling on big data can lead to the selection of proper equations, and it can be done using R.

查看原文本刊更多论文

贝叶斯和偏最小二乘全局牧草校准模型开发的迭代过程使用r

作者简介:我们研究的目的是测试在R软件中实现的验证迭代过程，评估最佳选择方程的准确性，使用两种不同的回归算法偏最小二乘(PLS)和贝叶斯。采用6种不同类型牧草3187条记录的数据集(Seta)。对蛋白质、中性洗涤纤维和酸性洗涤纤维进行了标定。每个样品使用FOSS NIRSystem (1100-2498 nm)采集光谱。每种饲料随机选取20个样本组成的子集(Setext;120个样本)，对最佳选择的方程进行最终验证。剩余样品(Setb = Seta - Setext)用于迭代校准过程。对于每次迭代，Setb被随机分为一个测试集(setst;10%的Setb)和一个训练集(setn = Setb - Settst);完成了300次迭代。所有的计算都是在R环境中完成的。使用的包是pls的“pls”，贝叶斯的“BGLR”，光谱处理的“prospectr”。在每次迭代中，我们使用三种光谱处理(原始，1导数，标准正态变量和趋势)，两种方法选择PLS成分的最佳数量和贝叶斯模型。在每次迭代中开发并测试了9种类型的方程[(2种PLS技术+ 1种贝叶斯)× 3种光谱处理]。在300次迭代中，对9种方程类型中的每一种都选取最佳(RMSE最低)和最佳25% (RMSE < 1四分位数)的平均值，并按饲料类型进行验证。在大数据集和复杂统计程序的化学计量过程中，R已经展示了它的潜力。几乎所有校准的R2均大于0.9。在外部验证中，贝叶斯模型在许多情况下优于常用的PLS，表明存在一种提高预测精度的替代方法。目前的工作已经证明，在大数据上迭代验证子抽样可以导致合适方程的选择，并且可以使用R来完成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th International Conference on Near Infrared Spectroscopy

自引率

0.00%

发文量