Rui Wang, Yongkun Li, Yinlong Xu, Hong Xie, John C. S. Lui, Shuibing He
{"title":"Toward Fast and Scalable Random Walks over Disk-Resident Graphs via Efficient I/O Management","authors":"Rui Wang, Yongkun Li, Yinlong Xu, Hong Xie, John C. S. Lui, Shuibing He","doi":"https://dl.acm.org/doi/10.1145/3533579","DOIUrl":null,"url":null,"abstract":"<p>Traditional graph systems mainly use the iteration-based model, which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this article, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system, <sans-serif>GraphWalker</sans-serif>, which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that <sans-serif>GraphWalker</sans-serif> can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, when compared with the most recent distributed system KnightKing, <sans-serif>GraphWalker</sans-serif> still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.</p>","PeriodicalId":49113,"journal":{"name":"ACM Transactions on Storage","volume":"68 7","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2022-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Storage","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3533579","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Traditional graph systems mainly use the iteration-based model, which iteratively loads graph blocks into memory for analysis so as to reduce random I/Os. However, this iteration-based model limits the efficiency and scalability of running random walk, which is a fundamental technique to analyze large graphs. In this article, we first propose a state-aware I/O model to improve the I/O efficiency of running random walk, then we develop a block-centric indexing and buffering scheme for managing walk data, and leverage an asynchronous walk updating strategy to improve random walk efficiency. We implement an I/O-efficient graph system, GraphWalker, which is efficient to handle very large disk-resident graphs and also scalable to run tens of billions of random walks with only a single commodity machine. Experiments show that GraphWalker can achieve more than an order of magnitude speedup when compared with DrunkardMob, which is tailored for random walks based on the classical graph system GraphChi, as well as two state-of-the-art single-machine graph systems, Graphene and GraFSoft. Furthermore, when compared with the most recent distributed system KnightKing, GraphWalker still achieves comparable performance with only a single machine, thereby making it a more cost-effective alternative.
期刊介绍:
The ACM Transactions on Storage (TOS) is a new journal with an intent to publish original archival papers in the area of storage and closely related disciplines. Articles that appear in TOS will tend either to present new techniques and concepts or to report novel experiences and experiments with practical systems. Storage is a broad and multidisciplinary area that comprises of network protocols, resource management, data backup, replication, recovery, devices, security, and theory of data coding, densities, and low-power. Potential synergies among these fields are expected to open up new research directions.