Spark生态系统综述:大数据处理基础设施、机器学习与应用(扩展摘要)

2023 IEEE 39th International Conference on Data Engineering (ICDE) Pub Date : 2023-04-01 DOI:10.1109/ICDE55515.2023.00316

Shanjian Tang, Bin He, Ce Yu, Yusen Li, Kun Li

{"title":"Spark生态系统综述:大数据处理基础设施、机器学习与应用(扩展摘要)","authors":"Shanjian Tang, Bin He, Ce Yu, Yusen Li, Kun Li","doi":"10.1109/ICDE55515.2023.00316","DOIUrl":null,"url":null,"abstract":"With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Additionally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.","PeriodicalId":434744,"journal":{"name":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications (Extended abstract)\",\"authors\":\"Shanjian Tang, Bin He, Ce Yu, Yusen Li, Kun Li\",\"doi\":\"10.1109/ICDE55515.2023.00316\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Additionally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.\",\"PeriodicalId\":434744,\"journal\":{\"name\":\"2023 IEEE 39th International Conference on Data Engineering (ICDE)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 39th International Conference on Data Engineering (ICDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE55515.2023.00316\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 39th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE55515.2023.00316","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着大数据在工业和学术领域的爆炸式增长，应用大规模的数据处理系统来分析大数据变得非常重要。可以说，Spark是当今大规模数据计算系统中最先进的技术，因为它具有良好的特性，包括通用性、容错性、高性能的内存数据处理和可伸缩性。Spark采用灵活的RDD (Resident Distributed Dataset)编程模型，提供了一组转换和操作算子，用户可以根据自己的应用定制操作函数。它最初定位为一个快速和通用的数据处理系统。自引入以来，考虑到各种情况，已经进行了大量的研究工作，以使其更有效(更快)和更普遍。在这次调查中，我们的目标是对Spark的通用性和性能改进的各种优化技术进行全面的回顾。我们介绍了各种数据管理和处理系统，机器学习算法和Spark支持的应用程序。此外，我们还讨论了使用Spark进行大规模内存数据处理的开放问题和挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications (Extended abstract)

With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Additionally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE 39th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量