Understanding the Challenges and Assisting Developers with Developing Spark Applications

2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion) Pub Date : 2021-03-25 DOI:10.1109/ICSE-Companion52605.2021.00057

Zehao Wang

引用次数: 2

Abstract

To process data more efficiently, big data frameworks provide data abstractions to developers. However, due to the abstraction, there may be many challenges for developers to understand and debug the data processing code. To uncover the challenges in using big data frameworks, we first conduct an empirical study on 1,000 Apache Spark-related questions on Stack Overflow. We find that most of the challenges are related to data transformation and API usage. To solve these challenges, we design an approach, which assists developers with understanding and debugging data processing in Spark. Our approach leverages statistical sampling to minimize performance overhead, and provides intermediate information and hint messages for each data processing step of a chained method pipeline. The preliminary evaluation of our approach shows that it has low performance overhead and we receive good feedback from developers.

查看原文本刊更多论文

理解挑战并协助开发人员开发Spark应用程序

为了更有效地处理数据，大数据框架为开发人员提供了数据抽象。然而，由于抽象，开发人员在理解和调试数据处理代码时可能会遇到许多挑战。为了揭示使用大数据框架的挑战，我们首先在Stack Overflow上对1000个与Apache spark相关的问题进行了实证研究。我们发现，大多数挑战都与数据转换和API使用有关。为了解决这些挑战，我们设计了一种方法，帮助开发人员理解和调试Spark中的数据处理。我们的方法利用统计抽样来最小化性能开销，并为链式方法管道的每个数据处理步骤提供中间信息和提示消息。对我们方法的初步评估表明，它具有较低的性能开销，并且我们从开发人员那里得到了良好的反馈。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

自引率

0.00%

发文量