Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs

2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI) Pub Date : 2018-11-01 DOI:10.1109/ICTAI.2018.00157

Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis

{"title":"Investigating the Efficiency of Machine Learning Algorithms on MapReduce Clusters with SSDs","authors":"Leonidas Akritidis, Athanasios Fevgas, P. Tsompanopoulou, Panayiotis Bozanis","doi":"10.1109/ICTAI.2018.00157","DOIUrl":null,"url":null,"abstract":"In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.","PeriodicalId":254686,"journal":{"name":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2018.00157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the big data era, the efficient processing of large volumes of data has became a standard requirement for both organizations and enterprises. Since single workstations cannot sustain such tremendous workloads, MapReduce was introduced with the aim of providing a robust, easy, and fault-tolerant parallelization framework for the execution of applications on large clusters. One of the most representative examples of such applications is the machine learning algorithms which dominate the broad research area of data mining. Simultaneously, the recent advances in hardware technology led to the introduction of high-performing alternative devices for secondary storage, known as Solid State Drives (SSDs). In this paper we examine the perfor-mance of several parallel data mining algorithms on MapReduce clusters equipped with such modern hardware. More specifically, we investigate standard dataset preprocessing methods including vectorization and dimensionality reduction, and two supervised classifiers, Naive Bayes and Linear Regression. We compare the execution times of these algorithms on an experimental cluster equipped with both standard magnetic disks and SSDs, by employing two different datasets and by applying several different cluster configurations. Our experiments demonstrate that the usage of SSDs can accelerate the execution of machine learning methods by a margin which depends on the cluster setup and the nature of the applied algorithms.

查看原文本刊更多论文

基于ssd的MapReduce集群机器学习算法的效率研究

在大数据时代，高效处理海量数据已经成为组织和企业的标准需求。由于单个工作站无法承受如此巨大的工作负载，因此引入MapReduce的目的是为在大型集群上执行应用程序提供一个健壮、简单和容错的并行化框架。这种应用最具代表性的例子之一是机器学习算法，它主导了数据挖掘的广泛研究领域。同时，最近硬件技术的进步导致了二级存储的高性能替代设备的引入，即固态驱动器(ssd)。在本文中，我们研究了几种并行数据挖掘算法在配备这种现代硬件的MapReduce集群上的性能。更具体地说，我们研究了标准的数据集预处理方法，包括向量化和降维，以及两种监督分类器，朴素贝叶斯和线性回归。我们通过使用两种不同的数据集和应用几种不同的集群配置，比较了这些算法在配备标准磁盘和ssd的实验集群上的执行时间。我们的实验表明，ssd的使用可以在一定程度上加速机器学习方法的执行，这取决于集群设置和应用算法的性质。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI)

自引率

0.00%

发文量