MimoSketch: A Framework for Frequency-Based Mining Tasks on Multiple Nodes With Sketches

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2024-12-26 DOI:10.1109/TKDE.2024.3523034

Wenfei Wu;Yuchen Xu

{"title":"MimoSketch: A Framework for Frequency-Based Mining Tasks on Multiple Nodes With Sketches","authors":"Wenfei Wu;Yuchen Xu","doi":"10.1109/TKDE.2024.3523034","DOIUrl":null,"url":null,"abstract":"In distributed data stream mining, we abstract a MIMO scenario where a stream of <underline>multiple <underline>items is mined by <underline>multiple n<underline>odes. We design a framework named MimoSketch for the MIMO-specific scenario, which improves the fundamental mining tasks of item frequency estimation, item size distribution estimation, heavy hitter detection, heavy change detection, and entropy estimation. MimoSketch consists of an algorithm design and a policy to schedule items to nodes. MimoSketch's algorithm applies random counting to preserve a mathematically proven <italic>unbiasedness property, which makes it friendly to the aggregate query on multiple nodes; its memory layout is <italic>dynamically adaptive to the runtime item size distribution, which maximizes the estimation accuracy by storing more items. MimoSketch's scheduling policy balances items among nodes, avoiding nodes being overloaded or underloaded, which improves the overall mining accuracy. Our prototype and evaluation show that our algorithm can improve the accuracy of five typical mining tasks by an order of magnitude compared with the state-of-the-art solutions, and the scheduling policy further promotes the performance in MIMO scenarios.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 3","pages":"1311-1324"},"PeriodicalIF":8.9000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10816464/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In distributed data stream mining, we abstract a MIMO scenario where a stream of multiple items is mined by multiple nodes. We design a framework named MimoSketch for the MIMO-specific scenario, which improves the fundamental mining tasks of item frequency estimation, item size distribution estimation, heavy hitter detection, heavy change detection, and entropy estimation. MimoSketch consists of an algorithm design and a policy to schedule items to nodes. MimoSketch's algorithm applies random counting to preserve a mathematically proven unbiasedness property, which makes it friendly to the aggregate query on multiple nodes; its memory layout is dynamically adaptive to the runtime item size distribution, which maximizes the estimation accuracy by storing more items. MimoSketch's scheduling policy balances items among nodes, avoiding nodes being overloaded or underloaded, which improves the overall mining accuracy. Our prototype and evaluation show that our algorithm can improve the accuracy of five typical mining tasks by an order of magnitude compared with the state-of-the-art solutions, and the scheduling policy further promotes the performance in MIMO scenarios.

查看原文本刊更多论文

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.