DHive: Query Execution Performance Analysis via Dataflow in Apache Hive

IF 3.3 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Proceedings of the Vldb Endowment Pub Date : 2023-08-01 DOI:10.14778/3611540.3611605

Chaozu Zhang, Qiaomu Shen, Bo Tang

引用次数: 0

Abstract

Nowadays, Apache Hive has been widely used for large-scale data analysis applications in many organizations. Various visual analytical tools are developed to help Hive users quickly analyze the query execution process and identify the performance bottleneck of executed queries. However, existing tools mostly focus on showing the time usage of query sub-components (jobs and operators) but fail to provide enough evidence to analyze the root reasons for the slow execution progress. To tackle this problem, we develop a visual analytical system DHive to visualize and analyze the query execution progress via dataflow analysis. DHive shows the dataflow during query execution at multiple levels: query level, job level and task level, which enable users to identify the key jobs/tasks and explain their time usage by linking them to the auxiliary information such as the system configuration and hardware status. We demonstrate the effectiveness of DHive by two cases in a production cluster. DHive is open-source at https://github.com/DBGroup-SUSTech/DHive.git.

查看原文本刊更多论文

Hive: Apache Hive中基于数据流的查询执行性能分析

如今，Apache Hive已被广泛用于许多组织的大规模数据分析应用程序。Hive开发了各种可视化分析工具，帮助用户快速分析查询执行过程，识别执行查询的性能瓶颈。但是，现有的工具主要侧重于显示查询子组件(作业和操作符)的时间使用情况，但无法提供足够的证据来分析执行进度缓慢的根本原因。为了解决这个问题，我们开发了一个可视化分析系统hive，通过数据流分析对查询执行过程进行可视化分析。hive在多个级别显示查询执行过程中的数据流:查询级别、作业级别和任务级别，用户可以通过将关键的作业/任务与系统配置和硬件状态等辅助信息联系起来，从而识别关键的作业/任务并解释其时间使用情况。我们通过一个生产集群中的两个案例来演示hive的有效性。hive是开源的，网址是https://github.com/DBGroup-SUSTech/DHive.git。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Vldb Endowment Computer Science-General Computer Science

CiteScore

7.70

自引率

0.00%

发文量

期刊介绍： The Proceedings of the VLDB (PVLDB) welcomes original research papers on a broad range of research topics related to all aspects of data management, where systems issues play a significant role, such as data management system technology and information management infrastructures, including their very large scale of experimentation, novel architectures, and demanding applications as well as their underpinning theory. The scope of a submission for PVLDB is also described by the subject areas given below. Moreover, the scope of PVLDB is restricted to scientific areas that are covered by the combined expertise on the submission’s topic of the journal’s editorial board. Finally, the submission’s contributions should build on work already published in data management outlets, e.g., PVLDB, VLDBJ, ACM SIGMOD, IEEE ICDE, EDBT, ACM TODS, IEEE TKDE, and go beyond a syntactic citation.