连接并上的抽样

Yurong Liu, Yunlong Xu, F. Nargesian
{"title":"连接并上的抽样","authors":"Yurong Liu, Yunlong Xu, F. Nargesian","doi":"10.1145/3555041.3589400","DOIUrl":null,"url":null,"abstract":"Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.","PeriodicalId":161812,"journal":{"name":"Companion of the 2023 International Conference on Management of Data","volume":"56 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sampling over Union of Joins\",\"authors\":\"Yurong Liu, Yunlong Xu, F. Nargesian\",\"doi\":\"10.1145/3555041.3589400\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.\",\"PeriodicalId\":161812,\"journal\":{\"name\":\"Companion of the 2023 International Conference on Management of Data\",\"volume\":\"56 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Companion of the 2023 International Conference on Management of Data\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3555041.3589400\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion of the 2023 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3555041.3589400","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

数据科学家经常利用多个关系数据源进行分析。学习和近似查询回答的一个标准假设是,数据是底层分布的一个统一且独立的样本。为了避免连接和并集的代价,给定一组连接,研究了在不执行完全连接和并集的情况下从连接并集中获得随机样本的问题。给出了链、无环和循环连接集合并上随机抽样的一般框架,保证了抽样的一致性和独立性。研究了连接并大小评估的新问题,提出了两种基于列直方图和数据随机游走的近似方法。我们提出了一个在线联合采样框架,该框架使用易于计算的参数近似进行初始化,并在采样过程中动态地对其进行改进。我们在TPC-H基准的工作负载上评估了我们的框架,并探索了联合逼近的准确性和采样效率之间的权衡。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Sampling over Union of Joins
Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信