Fast Natural Language Based Data Exploration with Samples

Companion of the 2023 International Conference on Management of Data Pub Date : 2023-06-04 DOI:10.1145/3555041.3589724

Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra

{"title":"Fast Natural Language Based Data Exploration with Samples","authors":"Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra","doi":"10.1145/3555041.3589724","DOIUrl":null,"url":null,"abstract":"The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.

查看原文本刊更多论文

基于样本的快速自然语言数据探索

从大量数据中及时提取见解的能力是一个关键问题。探索性数据分析(EDA)通常用于分析人员使用一系列SQL命令和相关的可视化来揭示见解。然而，在许多情况下，这个过程是由非程序员执行的，他们必须在紧迫的时间限制内工作，例如在营销活动中，营销人员必须快速分析大量数据以达到目标收益。本文介绍了ApproxEDA——一个结合了用于洞察发现的自然语言处理(NLP)接口和底层基于样本的EDA引擎的系统。NLP接口可以将高级问题转换为数据集的上下文SQL查询，而后端EDA引擎通过使用各种采样策略从许多预先创建的样本中选择最优样本，大大加快了洞察发现。我们证明了ApproxEDA解决了两个关键方面:将高级NLP输入转换为上下文SQL，并使用强化学习代理智能地选择样本。这可以防止用户偏离其分析的原始意图，这可能是由于结果和可视化中的近似误差造成的，同时仍然通过使用样本提供最佳的延迟减少。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion of the 2023 International Conference on Management of Data

自引率

0.00%

发文量