WavingSketch: an unbiased and generic sketch for finding top-k items in data streams

Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang
{"title":"WavingSketch: an unbiased and generic sketch for finding top-k items in data streams","authors":"Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang","doi":"10.1007/s00778-024-00869-6","DOIUrl":null,"url":null,"abstract":"<p>Finding top-<i>k</i> items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-<i>k</i> algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-<i>k</i> frequent items, finding top-<i>k</i> heavy changes, finding top-<i>k</i> persistent items, finding top-<i>k</i> Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves <span>\\(10 \\times \\)</span> faster speed and <span>\\(10^3 \\times \\)</span> smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00869-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Finding top-k items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-k algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-k frequent items, finding top-k heavy changes, finding top-k persistent items, finding top-k Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves \(10 \times \) faster speed and \(10^3 \times \) smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).

Abstract Image

WavingSketch:用于在数据流中查找前 k 项的无偏通用草图
查找数据流中的顶 k 项是数据挖掘中的一个基本问题。无偏估计是 top-k 算法公认的优雅而重要的特性。在本文中,我们提出了一种名为 WavingSketch 的新型草图算法,它比现有的无偏算法更加精确。我们从理论上证明了 WavingSketch 可以提供无偏估计,并推导出其误差边界。WavingSketch 是测量任务的通用算法,我们将其应用于五个应用领域:查找前 k 个频繁项、查找前 k 个重大变化项、查找前 k 个持久项、查找前 k 个超级传播者以及连接-聚合估计。我们的实验结果表明,与最先进的无偏空间节省法相比,WavingSketch在寻找频繁项方面的速度更快,误差更小。在其他应用中,WavingSketch 也实现了更高的准确率和更快的速度。所有相关代码都在 GitHub 上开源(https://github.com/WavingSketch/Waving-Sketch)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信