An in-depth analysis of data reduction methods for sustainable deep learning.

Open research Europe Pub Date : 2024-09-18 eCollection Date: 2024-01-01 DOI:10.12688/openreseurope.17554.2

Javier Perera-Lago, Victor Toscano-Duran, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz, Miguel A Gutiérrez-Naranjo, Matteo Rucco

{"title":"An in-depth analysis of data reduction methods for sustainable deep learning.","authors":"Javier Perera-Lago, Victor Toscano-Duran, Eduardo Paluzo-Hidalgo, Rocio Gonzalez-Diaz, Miguel A Gutiérrez-Naranjo, Matteo Rucco","doi":"10.12688/openreseurope.17554.2","DOIUrl":null,"url":null,"abstract":"<p><p>In recent years, deep learning has gained popularity for its ability to solve complex classification tasks. It provides increasingly better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure the similarity between the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.</p>","PeriodicalId":74359,"journal":{"name":"Open research Europe","volume":"4 ","pages":"101"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11413558/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open research Europe","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/openreseurope.17554.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, deep learning has gained popularity for its ability to solve complex classification tasks. It provides increasingly better results thanks to the development of more accurate models, the availability of huge volumes of data and the improved computational capabilities of modern computers. However, these improvements in performance also bring efficiency problems, related to the storage of datasets and models, and to the waste of energy and time involved in both the training and inference processes. In this context, data reduction can help reduce energy consumption when training a deep learning model. In this paper, we present up to eight different methods to reduce the size of a tabular training dataset, and we develop a Python package to apply them. We also introduce a representativeness metric based on topology to measure the similarity between the reduced datasets and the full training dataset. Additionally, we develop a methodology to apply these data reduction methods to image datasets for object detection tasks. Finally, we experimentally compare how these data reduction methods affect the representativeness of the reduced dataset, the energy consumption and the predictive performance of the model.

查看原文本刊更多论文

深入分析可持续深度学习的数据缩减方法。

近年来，深度学习因其解决复杂分类任务的能力而广受欢迎。得益于更精确模型的开发、海量数据的可用性以及现代计算机计算能力的提高，深度学习能提供越来越好的结果。然而，性能的提高也带来了效率问题，这与数据集和模型的存储有关，也与训练和推理过程中的能源和时间浪费有关。在这种情况下，减少数据有助于降低深度学习模型训练时的能耗。在本文中，我们介绍了多达八种不同的方法来减小表格式训练数据集的大小，并开发了一个 Python 软件包来应用这些方法。我们还引入了一种基于拓扑结构的代表性指标，用于衡量缩小后的数据集与完整训练数据集之间的相似性。此外，我们还开发了一种方法，将这些数据缩减方法应用于图像数据集，以完成物体检测任务。最后，我们通过实验比较了这些数据缩减方法如何影响缩减后数据集的代表性、能耗和模型的预测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Open research Europe

CiteScore

1.50

自引率

0.00%

发文量