敏捷分析的敏捷样本维护方法

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI:10.1109/ICDE48307.2020.00071

Hanbing Zhang, Yazhong Zhang, Zhenying He, Yinan Jing, Kai Zhang, X. S. Wang

{"title":"敏捷分析的敏捷样本维护方法","authors":"Hanbing Zhang, Yazhong Zhang, Zhenying He, Yinan Jing, Kai Zhang, X. S. Wang","doi":"10.1109/ICDE48307.2020.00071","DOIUrl":null,"url":null,"abstract":"Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.","PeriodicalId":6709,"journal":{"name":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","volume":"117 1","pages":"757-768"},"PeriodicalIF":0.0000,"publicationDate":"2020-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Agile Sample Maintenance Approach for Agile Analytics\",\"authors\":\"Hanbing Zhang, Yazhong Zhang, Zhenying He, Yinan Jing, Kai Zhang, X. S. Wang\",\"doi\":\"10.1109/ICDE48307.2020.00071\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.\",\"PeriodicalId\":6709,\"journal\":{\"name\":\"2020 IEEE 36th International Conference on Data Engineering (ICDE)\",\"volume\":\"117 1\",\"pages\":\"757-768\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 36th International Conference on Data Engineering (ICDE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDE48307.2020.00071\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 36th International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE48307.2020.00071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

敏捷分析可以帮助组织通过做出及时的决策来获得并维持竞争优势。近似查询处理(AQP)是敏捷分析中的一种有用方法，它通过利用预先计算的样本来促进对大数据的快速查询。这种样本面临的一个问题是，当导入新数据时，很可能需要重新采样，以保持样本的新鲜度和AQP结果足够准确。对每一批新数据从头开始重新采样，称为全重采样方法，许多现有AQP工作采用这种方法，显然是一个非常昂贵的过程，可以使用更快的增量采样过程，例如油藏采样，来覆盖新到达的数据。然而，增量更新方法的缺点是不能增加样本量，这是一个问题，当底层数据分布发生巨大变化，需要扩大样本以保持AQP精度时。本文提出了一种自适应样本更新(ASU)方法，通过监测数据分布尽可能避免从头开始重新采样，并在需要重新采样之前使用增量更新方法。本文还提出了一种增强方法(T-ASU)，该方法在可以容忍少量查询不准确的情况下，尝试在不重新采样的情况下扩大样本大小，以进一步降低样本更新成本。这两种方法被整合到最先进的AQP引擎中进行广泛的实验研究。在真实数据集和合成数据集上的实验结果表明，当底层数据分布连续变化时，这两种方法都比完全重采样方法更快，同时获得几乎相同的AQP精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Agile Sample Maintenance Approach for Agile Analytics

Agile analytics can help organizations to gain and sustain a competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging a pre-computed sample. One problem such a sample faces is that when new data is being imported, re-sampling is most likely needed to keep the sample fresh and AQP results accurate enough. Re-sampling from scratch for every batch of new data, called the full re-sampling method and adopted by many existing AQP works, is obviously a very costly process, and a much quicker incremental sampling process, such as reservoir sampling, may be used to cover the newly arrived data. However, incremental update methods suffer from the fact that the sample size cannot be increased, which is a problem when the underlying data distribution dramatically changes and the sample needs to be enlarged to maintain the AQP accuracy. This paper proposes an adaptive sample update (ASU) approach that avoids re-sampling from scratch as much as possible by monitoring the data distribution, and uses instead an incremental update method before a re-sampling becomes necessary. The paper also proposes an enhanced approach (T-ASU), which tries to enlarge the sample size without re-sampling from scratch when a bit of query inaccuracy is tolerable to further reduce the sample update cost. These two approaches are integrated into a state-of-the-art AQP engine for an extensive experimental study. Experimental results on both real-world and synthetic datasets show that the two approaches are faster than the full re-sampling method while achieving almost the same AQP accuracy when the underlying data distribution continuously changes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量