支持大数据分析的分布式计算框架综述

IF 6.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Mining and Analytics Pub Date : 2023-01-26 DOI:10.26599/BDMA.2022.9020014

Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang

{"title":"支持大数据分析的分布式计算框架综述","authors":"Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang","doi":"10.26599/BDMA.2022.9020014","DOIUrl":null,"url":null,"abstract":"Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.","PeriodicalId":52355,"journal":{"name":"Big Data Mining and Analytics","volume":"6 2","pages":"154-169"},"PeriodicalIF":6.2000,"publicationDate":"2023-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026506.pdf","citationCount":"6","resultStr":"{\"title\":\"Survey of Distributed Computing Frameworks for Supporting Big Data Analysis\",\"authors\":\"Xudong Sun;Yulin He;Dingming Wu;Joshua Zhexue Huang\",\"doi\":\"10.26599/BDMA.2022.9020014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.\",\"PeriodicalId\":52355,\"journal\":{\"name\":\"Big Data Mining and Analytics\",\"volume\":\"6 2\",\"pages\":\"154-169\"},\"PeriodicalIF\":6.2000,\"publicationDate\":\"2023-01-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/iel7/8254253/10026288/10026506.pdf\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data Mining and Analytics\",\"FirstCategoryId\":\"1093\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10026506/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Mining and Analytics","FirstCategoryId":"1093","ListUrlMain":"https://ieeexplore.ieee.org/document/10026506/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 6

摘要

分布式计算框架是分布式计算系统的基本组成部分。它们提供了一种重要的方式来支持在集群或云上高效处理大数据。大数据的规模增长速度快于集群大数据处理能力的增长速度。因此，基于MapReduce计算模型的分布式计算框架不足以支持大数据分析任务，这些任务通常需要在以TB为单位的超大数据集上运行复杂的分析算法。在执行此类任务时，这些框架面临三个挑战：由于I/O和通信成本高，计算效率低下；由于内存限制，无法扩展到大数据；以及由于许多串行算法无法在MapReduce编程模型中实现，分析算法有限。需要开发新的分布式计算框架来克服这些挑战。在本文中，我们回顾了目前用于处理大数据的MapReduce类型的分布式计算框架，并讨论了它们在进行大数据分析时存在的问题。此外，我们提出了一个非MapReduce分布式计算框架，该框架有可能克服大数据分析的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Survey of Distributed Computing Frameworks for Supporting Big Data Analysis

Distributed computing frameworks are the fundamental component of distributed computing systems. They provide an essential way to support the efficient processing of big data on clusters or cloud. The size of big data increases at a pace that is faster than the increase in the big data processing capacity of clusters. Thus, distributed computing frameworks based on the MapReduce computing model are not adequate to support big data analysis tasks which often require running complex analytical algorithms on extremely big data sets in terabytes. In performing such tasks, these frameworks face three challenges: computational inefficiency due to high I/O and communication costs, non-scalability to big data due to memory limit, and limited analytical algorithms because many serial algorithms cannot be implemented in the MapReduce programming model. New distributed computing frameworks need to be developed to conquer these challenges. In this paper, we review MapReduce-type distributed computing frameworks that are currently used in handling big data and discuss their problems when conducting big data analysis. In addition, we present a non-MapReduce distributed computing framework that has the potential to overcome big data analysis challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Big Data Mining and Analytics Computer Science-Computer Science Applications

CiteScore

20.90

自引率

2.20%

发文量

期刊介绍： Big Data Mining and Analytics, a publication by Tsinghua University Press, presents groundbreaking research in the field of big data research and its applications. This comprehensive book delves into the exploration and analysis of vast amounts of data from diverse sources to uncover hidden patterns, correlations, insights, and knowledge. Featuring the latest developments, research issues, and solutions, this book offers valuable insights into the world of big data. It provides a deep understanding of data mining techniques, data analytics, and their practical applications. Big Data Mining and Analytics has gained significant recognition and is indexed and abstracted in esteemed platforms such as ESCI, EI, Scopus, DBLP Computer Science, Google Scholar, INSPEC, CSCD, DOAJ, CNKI, and more. With its wealth of information and its ability to transform the way we perceive and utilize data, this book is a must-read for researchers, professionals, and anyone interested in the field of big data analytics.