基于Map-Reduce的大数据应用距离加权k近邻机器学习算法

IF 0.9 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING

Scalable Computing-Practice and Experience Pub Date : 2022-12-22 DOI:10.12694/scpe.v23i4.1987

E. Gothai, V. Muthukumaran, K. Valarmathi, Sathishkumar V E, N. Thillaiarasu, P. Karthikeyan

{"title":"基于Map-Reduce的大数据应用距离加权k近邻机器学习算法","authors":"E. Gothai, V. Muthukumaran, K. Valarmathi, Sathishkumar V E, N. Thillaiarasu, P. Karthikeyan","doi":"10.12694/scpe.v23i4.1987","DOIUrl":null,"url":null,"abstract":"With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3% to 4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.","PeriodicalId":43791,"journal":{"name":"Scalable Computing-Practice and Experience","volume":"69 1","pages":""},"PeriodicalIF":0.9000,"publicationDate":"2022-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Map-Reduce based Distance Weighted k-Nearest Neighbor Machine Learning Algorithm for Big Data Applications\",\"authors\":\"E. Gothai, V. Muthukumaran, K. Valarmathi, Sathishkumar V E, N. Thillaiarasu, P. Karthikeyan\",\"doi\":\"10.12694/scpe.v23i4.1987\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3% to 4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.\",\"PeriodicalId\":43791,\"journal\":{\"name\":\"Scalable Computing-Practice and Experience\",\"volume\":\"69 1\",\"pages\":\"\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2022-12-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scalable Computing-Practice and Experience\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.12694/scpe.v23i4.1987\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scalable Computing-Practice and Experience","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12694/scpe.v23i4.1987","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 1

摘要

随着互联网标准的演进和各种互联网和移动技术的进步，特别是自web 4.0以来，越来越多的网络和移动应用出现，如电子商务、社交网络、在线游戏应用和基于物联网的应用。由于这些应用程序在互联网和移动设备上的部署和并发访问，数据量和产生的数据种类呈指数级增长，新的大数据时代已经出现。现有的数据结构和数据分析算法无法处理这样的大数据。因此，需要可扩展、灵活、并行和智能的数据分析算法来处理和分析复杂的海量数据。本文提出了一种基于MapReduce编程模型和距离加权k近邻算法的分布式监督机器学习算法MR-DWkNN，用于Hadoop集群环境下的大数据处理和分析。本文提出的分布式算法基于监督学习，在大数据应用中既可以执行回归任务，也可以执行分类任务。利用回归任务的均方根误差(RMSE)、决定系数(R2)和分类任务的准确率(Accuracy)三个性能指标来衡量MR-DWkNN算法的性能。大量的实验结果表明，与标准分布式k-NN算法相比，该算法的预测和分类性能平均提高3% ~ 4.5%，均方根误差(RMSE)显著降低，具有良好的并行性、可扩展性和加速特性，证明了其在大数据预测和分类应用中的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Map-Reduce based Distance Weighted k-Nearest Neighbor Machine Learning Algorithm for Big Data Applications

With the evolution of Internet standards and advancements in various Internet and mobile technologies, especially since web 4.0, more and more web and mobile applications emerge such as e-commerce, social networks, online gaming applications and Internet of Things based applications. Due to the deployment and concurrent access of these applications on the Internet and mobile devices, the amount of data and the kind of data generated increases exponentially and the new era of Big Data has come into existence. Presently available data structures and data analyzing algorithms are not capable to handle such Big Data. Hence, there is a need for scalable, flexible, parallel and intelligent data analyzing algorithms to handle and analyze the complex massive data. In this article, we have proposed a novel distributed supervised machine learning algorithm based on the MapReduce programming model and Distance Weighted k-Nearest Neighbor algorithm called MR-DWkNN to process and analyze the Big Data in the Hadoop cluster environment. The proposed distributed algorithm is based on supervised learning performs both regression tasks as well as classification tasks on large-volume of Big Data applications. Three performance metrics, such as Root Mean Squared Error (RMSE), Determination coefficient (R2) for regression task, and Accuracy for classification tasks are utilized for the performance measure of the proposed MR-DWkNN algorithm. The extensive experimental results shows that there is an average increase of 3% to 4.5% prediction and classification performances as compared to standard distributed k-NN algorithm and a considerable decrease of Root Mean Squared Error (RMSE) with good parallelism characteristics of scalability and speedup thus, proves its effectiveness in Big Data predictive and classification applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Scalable Computing-Practice and Experience COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

2.00

自引率

0.00%

发文量

期刊介绍： The area of scalable computing has matured and reached a point where new issues and trends require a professional forum. SCPE will provide this avenue by publishing original refereed papers that address the present as well as the future of parallel and distributed computing. The journal will focus on algorithm development, implementation and execution on real-world parallel architectures, and application of parallel and distributed computing to the solution of real-life problems.