{"title":"机器学习驱动的橄榄油质量预测:在多阶段验证中使用RF和XGBoost模型对FTMIR数据预处理技术的比较评估","authors":"Lahcen Hssaini","doi":"10.1016/j.meafoo.2025.100249","DOIUrl":null,"url":null,"abstract":"<div><div>This study evaluates the impact of four mid-FTIR spectral preprocessing strategies—baseline correction, normalization, smoothing, first derivative transformation, all compared to raw data—on the performance of Random Forest (RF) and XGBoost (XGB) models for predicting key olive oil quality parameters mainly total phenolic content (TPC), total flavonoid content (TFC), DPPH radical scavenging activity, and carotenoid levels in the Picholine Marocaine cultivar. Using a dataset of 324 olive oil samples, models were trained and validated via a multi-stage framework (5-fold CV and 20 % external validation). Results revealed that smoothing significantly enhanced TPC prediction (XGB R² = 0.96, RMSE = 24.5 mg GAE/kg) while first derivative transformation optimized TFC prediction (R² = 0.93, RMSE = 18.2 mg QE/kg). Raw data sufficed for carotenoids (R² > 0.89). XGBoost consistently outperformed RF by 7–15 % across parameters due to its superior regularization capabilities. Notably, blind testing exposed a 25 % R² drop for DPPH with RF, underscoring the necessity of external validation. These findings support the development of rapid, non-destructive quality assessment tools with applications in industrial quality control, authentication systems, and regulatory compliance. Future research should explore hybrid preprocessing combinations, deep chemometric feature extraction, multi-cultivar validation, and seasonal model transferability to enhance robustness and commercial viability.</div></div>","PeriodicalId":100898,"journal":{"name":"Measurement: Food","volume":"19 ","pages":"Article 100249"},"PeriodicalIF":3.6000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ML-driven olive oil quality prediction: Comparative evaluation of FTMIR data preprocessing techniques using RF and XGBoost models in multi-stage validation\",\"authors\":\"Lahcen Hssaini\",\"doi\":\"10.1016/j.meafoo.2025.100249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>This study evaluates the impact of four mid-FTIR spectral preprocessing strategies—baseline correction, normalization, smoothing, first derivative transformation, all compared to raw data—on the performance of Random Forest (RF) and XGBoost (XGB) models for predicting key olive oil quality parameters mainly total phenolic content (TPC), total flavonoid content (TFC), DPPH radical scavenging activity, and carotenoid levels in the Picholine Marocaine cultivar. Using a dataset of 324 olive oil samples, models were trained and validated via a multi-stage framework (5-fold CV and 20 % external validation). Results revealed that smoothing significantly enhanced TPC prediction (XGB R² = 0.96, RMSE = 24.5 mg GAE/kg) while first derivative transformation optimized TFC prediction (R² = 0.93, RMSE = 18.2 mg QE/kg). Raw data sufficed for carotenoids (R² > 0.89). XGBoost consistently outperformed RF by 7–15 % across parameters due to its superior regularization capabilities. Notably, blind testing exposed a 25 % R² drop for DPPH with RF, underscoring the necessity of external validation. These findings support the development of rapid, non-destructive quality assessment tools with applications in industrial quality control, authentication systems, and regulatory compliance. Future research should explore hybrid preprocessing combinations, deep chemometric feature extraction, multi-cultivar validation, and seasonal model transferability to enhance robustness and commercial viability.</div></div>\",\"PeriodicalId\":100898,\"journal\":{\"name\":\"Measurement: Food\",\"volume\":\"19 \",\"pages\":\"Article 100249\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Measurement: Food\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S277227592500036X\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement: Food","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S277227592500036X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
ML-driven olive oil quality prediction: Comparative evaluation of FTMIR data preprocessing techniques using RF and XGBoost models in multi-stage validation
This study evaluates the impact of four mid-FTIR spectral preprocessing strategies—baseline correction, normalization, smoothing, first derivative transformation, all compared to raw data—on the performance of Random Forest (RF) and XGBoost (XGB) models for predicting key olive oil quality parameters mainly total phenolic content (TPC), total flavonoid content (TFC), DPPH radical scavenging activity, and carotenoid levels in the Picholine Marocaine cultivar. Using a dataset of 324 olive oil samples, models were trained and validated via a multi-stage framework (5-fold CV and 20 % external validation). Results revealed that smoothing significantly enhanced TPC prediction (XGB R² = 0.96, RMSE = 24.5 mg GAE/kg) while first derivative transformation optimized TFC prediction (R² = 0.93, RMSE = 18.2 mg QE/kg). Raw data sufficed for carotenoids (R² > 0.89). XGBoost consistently outperformed RF by 7–15 % across parameters due to its superior regularization capabilities. Notably, blind testing exposed a 25 % R² drop for DPPH with RF, underscoring the necessity of external validation. These findings support the development of rapid, non-destructive quality assessment tools with applications in industrial quality control, authentication systems, and regulatory compliance. Future research should explore hybrid preprocessing combinations, deep chemometric feature extraction, multi-cultivar validation, and seasonal model transferability to enhance robustness and commercial viability.