大数据处理框架和架构:调查

Handbook of Big Data Analytics. Volume 1: Methodologies Pub Date : 2021-07-07 DOI:10.1049/pbpc037f_ch2

R. Chunduri, A. Cherukuri

{"title":"大数据处理框架和架构:调查","authors":"R. Chunduri, A. Cherukuri","doi":"10.1049/pbpc037f_ch2","DOIUrl":null,"url":null,"abstract":"In recent times, there has been rapid growth in data generated from autonomous sources. The existing data processing techniques are not suitable to deal with these large volumes of complex data that can be structured, semi-structured or unstructured. This large data is referred to as Big data because of its main characteristics: volume, variety velocity, value and veracity. Extensive research on Big data is ongoing, and the primary focus of this research is on processing massive amounts of data effectively and efficiently. However, researchers are paying little attention on how to store and analyze the large volumes of data to get useful insights from it. In this chapter, the authors examine existing Big data processing frameworks like MapReduce, Apache Spark, Storm and Flink. In this chapter, the architectures of MapReduce, iterative MapReduce frameworks and components of Apache Spark are discussed in detail. Most of the widely used classical machine learning techniques are implemented using these Big data frameworks in the form of Apache Mahout and Spark MLlib libraries and these need to be enhanced to support all existing machine learning techniques like formal concept analysis (FCA) and neural embedding. In this chapter, authors have taken FCA as an application and provided scalable FCA algorithms using the Big data processing frameworks like MapReduce and Spark. Streaming data processing frameworks like Apache Flink and Apache Storm is also examined. Authors also discuss about the storage architectures like Hadoop Distributed File System (HDFS), Dynamo and Amazon S3 in detail while processing large Big data applications. The survey concludes with a proposal for best practices related to the studied architectures and frameworks.","PeriodicalId":162132,"journal":{"name":"Handbook of Big Data Analytics. Volume 1: Methodologies","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Big data processing frameworks and architectures: a survey\",\"authors\":\"R. Chunduri, A. Cherukuri\",\"doi\":\"10.1049/pbpc037f_ch2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent times, there has been rapid growth in data generated from autonomous sources. The existing data processing techniques are not suitable to deal with these large volumes of complex data that can be structured, semi-structured or unstructured. This large data is referred to as Big data because of its main characteristics: volume, variety velocity, value and veracity. Extensive research on Big data is ongoing, and the primary focus of this research is on processing massive amounts of data effectively and efficiently. However, researchers are paying little attention on how to store and analyze the large volumes of data to get useful insights from it. In this chapter, the authors examine existing Big data processing frameworks like MapReduce, Apache Spark, Storm and Flink. In this chapter, the architectures of MapReduce, iterative MapReduce frameworks and components of Apache Spark are discussed in detail. Most of the widely used classical machine learning techniques are implemented using these Big data frameworks in the form of Apache Mahout and Spark MLlib libraries and these need to be enhanced to support all existing machine learning techniques like formal concept analysis (FCA) and neural embedding. In this chapter, authors have taken FCA as an application and provided scalable FCA algorithms using the Big data processing frameworks like MapReduce and Spark. Streaming data processing frameworks like Apache Flink and Apache Storm is also examined. Authors also discuss about the storage architectures like Hadoop Distributed File System (HDFS), Dynamo and Amazon S3 in detail while processing large Big data applications. The survey concludes with a proposal for best practices related to the studied architectures and frameworks.\",\"PeriodicalId\":162132,\"journal\":{\"name\":\"Handbook of Big Data Analytics. Volume 1: Methodologies\",\"volume\":\"64 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Handbook of Big Data Analytics. Volume 1: Methodologies\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1049/pbpc037f_ch2\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Handbook of Big Data Analytics. Volume 1: Methodologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1049/pbpc037f_ch2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近年来，自主来源产生的数据增长迅速。现有的数据处理技术不适合处理这些大量的复杂数据，这些数据可以是结构化的、半结构化的或非结构化的。这种大数据之所以被称为大数据，是因为它的主要特征是:体积、变化速度、价值和准确性。关于大数据的广泛研究正在进行中，这些研究的主要重点是有效和高效地处理大量数据。然而，研究人员很少关注如何存储和分析大量数据以从中获得有用的见解。在本章中，作者研究了现有的大数据处理框架，如MapReduce、Apache Spark、Storm和Flink。本章详细讨论了MapReduce的架构、迭代MapReduce框架和Apache Spark的组件。大多数广泛使用的经典机器学习技术都是使用这些大数据框架以Apache Mahout和Spark MLlib库的形式实现的，这些需要增强以支持所有现有的机器学习技术，如形式概念分析(FCA)和神经嵌入。在本章中，作者将FCA作为一个应用，并使用MapReduce和Spark等大数据处理框架提供了可扩展的FCA算法。流数据处理框架，如Apache Flink和Apache Storm也进行了研究。在处理大型大数据应用时，作者还详细讨论了Hadoop分布式文件系统(HDFS)、Dynamo和Amazon S3等存储架构。调查以与所研究的体系结构和框架相关的最佳实践建议结束。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Big data processing frameworks and architectures: a survey

In recent times, there has been rapid growth in data generated from autonomous sources. The existing data processing techniques are not suitable to deal with these large volumes of complex data that can be structured, semi-structured or unstructured. This large data is referred to as Big data because of its main characteristics: volume, variety velocity, value and veracity. Extensive research on Big data is ongoing, and the primary focus of this research is on processing massive amounts of data effectively and efficiently. However, researchers are paying little attention on how to store and analyze the large volumes of data to get useful insights from it. In this chapter, the authors examine existing Big data processing frameworks like MapReduce, Apache Spark, Storm and Flink. In this chapter, the architectures of MapReduce, iterative MapReduce frameworks and components of Apache Spark are discussed in detail. Most of the widely used classical machine learning techniques are implemented using these Big data frameworks in the form of Apache Mahout and Spark MLlib libraries and these need to be enhanced to support all existing machine learning techniques like formal concept analysis (FCA) and neural embedding. In this chapter, authors have taken FCA as an application and provided scalable FCA algorithms using the Big data processing frameworks like MapReduce and Spark. Streaming data processing frameworks like Apache Flink and Apache Storm is also examined. Authors also discuss about the storage architectures like Hadoop Distributed File System (HDFS), Dynamo and Amazon S3 in detail while processing large Big data applications. The survey concludes with a proposal for best practices related to the studied architectures and frameworks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Handbook of Big Data Analytics. Volume 1: Methodologies

自引率

0.00%

发文量