Fast Natural Language Based Data Exploration with Samples

Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra
{"title":"Fast Natural Language Based Data Exploration with Samples","authors":"Shubham Agarwal, G. Chan, Shaddy Garg, Tong Yu, Subrata Mitra","doi":"10.1145/3555041.3589724","DOIUrl":null,"url":null,"abstract":"The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589724","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The ability to extract insights from large amounts of data in a timely manner is a crucial problem. Exploratory Data Analysis (EDA) is commonly used by analysts to uncover insights using a sequence of SQL commands and associated visualizations. However, in many cases, this process is carried out by non-programmers who must work within tight time constraints, such as in a marketing campaign where a marketer must quickly analyse large amounts of data to reach a target revenue. This paper presents ApproxEDA - a system that combines a natural language processing (NLP) interface for insight discovery with an underlying sample-based EDA engine. The NLP interface can convert high-level questions into contextual SQL queries of the dataset, while the backend EDA engine significantly speeds up insight discovery by selecting the most optimum sample from among many pre-created samples using various sampling strategies. We demonstrate that ApproxEDA addresses two key aspects: converting high-level NLP inputs to contextual SQL and intelligently selecting samples using a reinforcement learning agent. This protects users from diverging from their original intent of analysis, which can occur due to approximation errors in results and visualizations, while still providing optimal latency reduction through the use of samples.
基于样本的快速自然语言数据探索
从大量数据中及时提取见解的能力是一个关键问题。探索性数据分析(EDA)通常用于分析人员使用一系列SQL命令和相关的可视化来揭示见解。然而,在许多情况下,这个过程是由非程序员执行的,他们必须在紧迫的时间限制内工作,例如在营销活动中,营销人员必须快速分析大量数据以达到目标收益。本文介绍了ApproxEDA——一个结合了用于洞察发现的自然语言处理(NLP)接口和底层基于样本的EDA引擎的系统。NLP接口可以将高级问题转换为数据集的上下文SQL查询,而后端EDA引擎通过使用各种采样策略从许多预先创建的样本中选择最优样本,大大加快了洞察发现。我们证明了ApproxEDA解决了两个关键方面:将高级NLP输入转换为上下文SQL,并使用强化学习代理智能地选择样本。这可以防止用户偏离其分析的原始意图,这可能是由于结果和可视化中的近似误差造成的,同时仍然通过使用样本提供最佳的延迟减少。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信