Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang
{"title":"WavingSketch: an unbiased and generic sketch for finding top-k items in data streams","authors":"Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang","doi":"10.1007/s00778-024-00869-6","DOIUrl":null,"url":null,"abstract":"<p>Finding top-<i>k</i> items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-<i>k</i> algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-<i>k</i> frequent items, finding top-<i>k</i> heavy changes, finding top-<i>k</i> persistent items, finding top-<i>k</i> Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves <span>\\(10 \\times \\)</span> faster speed and <span>\\(10^3 \\times \\)</span> smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00869-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves \(10 \times \) faster speed and \(10^3 \times \) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).
查找数据流中的顶 k 项是数据挖掘中的一个基本问题。无偏估计是 top-k 算法公认的优雅而重要的特性。在本文中,我们提出了一种名为 WavingSketch 的新型草图算法,它比现有的无偏算法更加精确。我们从理论上证明了 WavingSketch 可以提供无偏估计,并推导出其误差边界。WavingSketch 是测量任务的通用算法,我们将其应用于五个应用领域:查找前 k 个频繁项、查找前 k 个重大变化项、查找前 k 个持久项、查找前 k 个超级传播者以及连接-聚合估计。我们的实验结果表明,与最先进的无偏空间节省法相比,WavingSketch在寻找频繁项方面的速度更快,误差更小。在其他应用中,WavingSketch 也实现了更高的准确率和更快的速度。所有相关代码都在 GitHub 上开源(https://github.com/WavingSketch/Waving-Sketch)。