{"title":"大数据分析架构:使用Hadoop-MapReduce和Spark扩展数据挖掘算法","authors":"Sheikh Kamaruddin, V. Ravi","doi":"10.1049/pbpc037f_ch7","DOIUrl":null,"url":null,"abstract":"Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike.","PeriodicalId":162132,"journal":{"name":"Handbook of Big Data Analytics. Volume 1: Methodologies","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Architectures of big data analytics: scaling out data mining algorithms using Hadoop–MapReduce and Spark\",\"authors\":\"Sheikh Kamaruddin, V. Ravi\",\"doi\":\"10.1049/pbpc037f_ch7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike.\",\"PeriodicalId\":162132,\"journal\":{\"name\":\"Handbook of Big Data Analytics. Volume 1: Methodologies\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Handbook of Big Data Analytics. Volume 1: Methodologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1049/pbpc037f_ch7\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Big Data Analytics. Volume 1: Methodologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/pbpc037f_ch7","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Architectures of big data analytics: scaling out data mining algorithms using Hadoop–MapReduce and Spark
Many statistical and machine learning (ML) techniques have been successfully applied to small-sized datasets during the past one and half decades. However, in today's world, different application domains, viz., healthcare, finance, bioinformatics, telecommunications, and meteorology, generate huge volumes of data on a daily basis. All these massive datasets have to be analyzed for discovering hidden insights. With the advent of big data analytics (BDA) paradigm, the data mining (DM) techniques were modified and scaled out to adapt to the distributed and parallel environment. This chapter reviewed 249 articles appeared between 2009 and 2019, which implemented different DM techniques in a parallel, distributed manner in the Apache Hadoop MapReduce framework or Apache Spark environment for solving various DM tasks. We present some critical analyses of these papers and bring out some interesting insights. We have found that methods like Apriori, support vector machine (SVM), random forest (RF), K-means and many variants of the previous along with many other approaches are made into parallel distributed environment and produced scalable and effective insights out of it. This review is concluded with a discussion of some open areas of research with future directions, which can be explored further by the researchers and practitioners alike.