Jože M. Rožanec , Gašper Petelin , João Costa , Gregor Cerar , Blaž Bertalanič , Marko Guček , Gregor Papa , Dunja Mladenić
{"title":"处理零膨胀数据:用双重机器学习方法实现最先进的技术","authors":"Jože M. Rožanec , Gašper Petelin , João Costa , Gregor Cerar , Blaž Bertalanič , Marko Guček , Gregor Papa , Dunja Mladenić","doi":"10.1016/j.engappai.2025.110339","DOIUrl":null,"url":null,"abstract":"<div><div>In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and even airport shuttle demand prediction. The presence of zeroes affects the models’ learning and may result in poor performance. Furthermore, zeroes also distort the metrics used to compute the model’s prediction quality. This paper showcases two real-world use cases (home appliances classification and airport shuttle demand prediction) where a hierarchical model applied in the context of zero-inflated data leads to considerable performance improvements. In particular, for home appliances classification, the weighted average of Precision, Recall, F1, and Area Under the Receiver Operating Characteristic Curve (AUC ROC) was increased by 39%, 49%, 88%, and 48%, respectively. Furthermore, it is estimated that the proposed approach is also four times more energy efficient than the state-of-the-art (SOTA) approach against which it was compared to. Two-fold modeling approaches significantly outperform regular regression, especially when predicting the occurrence of demand events. SOTA results were achieved using Gradient Boosting trees to determine whether an event will occur and Visual Geometry Group (VGG) or Support Vector Regressor (SVR) models for the subsequent classification/regression. The code has been released at two separate repositories.</div></div>","PeriodicalId":50523,"journal":{"name":"Engineering Applications of Artificial Intelligence","volume":"149 ","pages":"Article 110339"},"PeriodicalIF":8.0000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dealing with zero-inflated data: Achieving state-of-the-art with a two-fold machine learning approach\",\"authors\":\"Jože M. Rožanec , Gašper Petelin , João Costa , Gregor Cerar , Blaž Bertalanič , Marko Guček , Gregor Papa , Dunja Mladenić\",\"doi\":\"10.1016/j.engappai.2025.110339\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and even airport shuttle demand prediction. The presence of zeroes affects the models’ learning and may result in poor performance. Furthermore, zeroes also distort the metrics used to compute the model’s prediction quality. This paper showcases two real-world use cases (home appliances classification and airport shuttle demand prediction) where a hierarchical model applied in the context of zero-inflated data leads to considerable performance improvements. In particular, for home appliances classification, the weighted average of Precision, Recall, F1, and Area Under the Receiver Operating Characteristic Curve (AUC ROC) was increased by 39%, 49%, 88%, and 48%, respectively. Furthermore, it is estimated that the proposed approach is also four times more energy efficient than the state-of-the-art (SOTA) approach against which it was compared to. Two-fold modeling approaches significantly outperform regular regression, especially when predicting the occurrence of demand events. SOTA results were achieved using Gradient Boosting trees to determine whether an event will occur and Visual Geometry Group (VGG) or Support Vector Regressor (SVR) models for the subsequent classification/regression. The code has been released at two separate repositories.</div></div>\",\"PeriodicalId\":50523,\"journal\":{\"name\":\"Engineering Applications of Artificial Intelligence\",\"volume\":\"149 \",\"pages\":\"Article 110339\"},\"PeriodicalIF\":8.0000,\"publicationDate\":\"2025-03-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Engineering Applications of Artificial Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0952197625003392\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"AUTOMATION & CONTROL SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Engineering Applications of Artificial Intelligence","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0952197625003392","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
在许多情况下,机器学习模型必须学会在许多目标值为零的更广泛的数据范围内正确预测具有特定感兴趣值的几个数据点。零膨胀数据可以在不同的场景中找到,例如块块和间歇性需求,家用电器开关的功耗,蒸馏过程中的杂质测量,甚至机场班车需求预测。零的存在会影响模型的学习,并可能导致表现不佳。此外,零还会扭曲用于计算模型预测质量的指标。本文展示了两个现实世界的用例(家用电器分类和机场班车需求预测),其中在零膨胀数据的背景下应用层次模型可以显著提高性能。特别是在家电分类方面,准确率(Precision)、召回率(Recall)、F1和接收者工作特征曲线下面积(Area Under the Receiver Operating Characteristic Curve, AUC ROC)的加权平均值分别提高39%、49%、88%和48%。此外,据估计,所提出的方法也比与之比较的最先进(SOTA)方法节能四倍。双重建模方法显著优于常规回归,特别是在预测需求事件发生时。SOTA结果是使用梯度增强树来确定事件是否会发生,并使用视觉几何组(VGG)或支持向量回归(SVR)模型进行后续分类/回归。代码已经在两个独立的存储库中发布。
Dealing with zero-inflated data: Achieving state-of-the-art with a two-fold machine learning approach
In many cases, a machine learning model must learn to correctly predict a few data points with particular values of interest in a broader range of data where many target values are zero. Zero-inflated data can be found in diverse scenarios, such as lumpy and intermittent demands, power consumption for home appliances being turned on and off, impurities measurement in distillation processes, and even airport shuttle demand prediction. The presence of zeroes affects the models’ learning and may result in poor performance. Furthermore, zeroes also distort the metrics used to compute the model’s prediction quality. This paper showcases two real-world use cases (home appliances classification and airport shuttle demand prediction) where a hierarchical model applied in the context of zero-inflated data leads to considerable performance improvements. In particular, for home appliances classification, the weighted average of Precision, Recall, F1, and Area Under the Receiver Operating Characteristic Curve (AUC ROC) was increased by 39%, 49%, 88%, and 48%, respectively. Furthermore, it is estimated that the proposed approach is also four times more energy efficient than the state-of-the-art (SOTA) approach against which it was compared to. Two-fold modeling approaches significantly outperform regular regression, especially when predicting the occurrence of demand events. SOTA results were achieved using Gradient Boosting trees to determine whether an event will occur and Visual Geometry Group (VGG) or Support Vector Regressor (SVR) models for the subsequent classification/regression. The code has been released at two separate repositories.
期刊介绍:
Artificial Intelligence (AI) is pivotal in driving the fourth industrial revolution, witnessing remarkable advancements across various machine learning methodologies. AI techniques have become indispensable tools for practicing engineers, enabling them to tackle previously insurmountable challenges. Engineering Applications of Artificial Intelligence serves as a global platform for the swift dissemination of research elucidating the practical application of AI methods across all engineering disciplines. Submitted papers are expected to present novel aspects of AI utilized in real-world engineering applications, validated using publicly available datasets to ensure the replicability of research outcomes. Join us in exploring the transformative potential of AI in engineering.