Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua
{"title":"基于Apache Spark的机器学习模型性能评估:实证研究","authors":"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua","doi":"10.1109/CICN56167.2022.10008282","DOIUrl":null,"url":null,"abstract":"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.","PeriodicalId":287589,"journal":{"name":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study\",\"authors\":\"Asma Z. Yamani, Shikah J. Alsunaidi, Imane Boudellioua\",\"doi\":\"10.1109/CICN56167.2022.10008282\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.\",\"PeriodicalId\":287589,\"journal\":{\"name\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CICN56167.2022.10008282\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CICN56167.2022.10008282","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance Evaluation of Machine Learning Models on Apache Spark: An Empirical Study
Artificial intelligence (AI) and machine learning significantly improve many sectors, such as education, healthcare, and industry. Machine learning techniques mainly depend on the volume and diversity of training data. With the digital transformation we live in, an abundant amount of data can be collected from different sources. However, the problem that needs to be addressed is how this amount of data can be processed and where it can be stored. Cloud services and distributed file systems (DFSs) help address this issue. Many DFSs such as Hadoop, Quantcast, and Apache Spark differ in many aspects, including scheduling algorithms, data management protocol, throughput, and runtime. Some DFSs may be better for working with specific applications than others. Apache Spark is capable of handling iterative operations like machine learning operations as well as it provides an integrated library of different machine learning algorithms called MLlib. In this paper, we evaluated the use of Spark using two machine learning algorithms, namely Logistic Regression (LR) and Random Forests (RF). We investigated the effect of varying the memory allocation configuration and the use of GPU. We concluded that the use of Spark greatly improves the runtime and memory consumption. However, its use has to be justifiable and needed for the size of the data due to different factors that affect the machine learning model's accuracy. The memory allocation should be kept to the minimum needed, and GPU should only be used when the machine learning algorithm used supports parallelization.