基于Apache Spark的机器学习模型性能评估:实证研究

2022 14th International Conference on Computational Intelligence and Communication Networks (CICN) Pub Date : 2022-12-04 DOI:10.1109/CICN56167.2022.10008282

Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua

{"title":"基于Apache Spark的机器学习模型性能评估:实证研究","authors":"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua","doi":"10.1109/CICN56167.2022.10008282","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study\",\"authors\":\"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua\",\"doi\":\"10.1109/CICN56167.2022.10008282\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.\",\"PeriodicalId\":287589,\"journal\":{\"name\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CICN56167.2022.10008282\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008282","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人工智能(AI)和机器学习显著改善了许多领域，如教育、医疗保健和工业。机器学习技术主要依赖于训练数据的数量和多样性。随着我们所处的数字化转型，我们可以从不同的来源收集到大量的数据。然而，需要解决的问题是如何处理这些数据以及将其存储在何处。云服务和分布式文件系统(dfs)有助于解决这个问题。许多dfs(如Hadoop、Quantcast和Apache Spark)在许多方面存在差异，包括调度算法、数据管理协议、吞吐量和运行时。一些dfs可能比其他dfs更适合处理特定的应用程序。Apache Spark能够处理像机器学习操作这样的迭代操作，并且它提供了一个名为MLlib的不同机器学习算法的集成库。在本文中，我们使用两种机器学习算法，即逻辑回归(LR)和随机森林(RF)来评估Spark的使用。我们研究了不同内存分配配置和GPU使用的影响。我们得出的结论是，使用Spark极大地改善了运行时和内存消耗。然而，由于影响机器学习模型准确性的不同因素，它的使用必须是合理的，并且需要用于数据的大小。内存分配应该保持在所需的最小值，并且GPU应该只在使用的机器学习算法支持并行化时使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study

Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)

自引率

0.00%

发文量