Cache-conscious graph collaborative filtering on multi-socket multicore systems

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI:10.1145/2597917.2597935

Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee

{"title":"Cache-conscious graph collaborative filtering on multi-socket multicore systems","authors":"Lifeng Nai, Yinglong Xia, Ching-Yung Lin, Bo Hong, H. Lee","doi":"10.1145/2597917.2597935","DOIUrl":null,"url":null,"abstract":"Recommendation systems using graph collaborative filtering often require responses in real time and high throughput. Therefore, besides recommendation accuracy, it is critical to study high performance concurrent collaborative filtering on modern platforms. To achieve high performance, we study the graph data locality characteristics of collaborative filtering. Our experiments demonstrate that although an individual graph traversal exhibits poor data locality, multiple queries have a tendency of sharing their data footprints, especially in the case of queries with neighboring root vertices. Such characteristics lead to both inter- and intra-thread data locality, which can be utilized to significantly improve collaborative filtering performance. Based on these observations, we present a cache-conscious system for collaborative filtering on modern multi-socket multicore platforms. In this system, we propose a cache-conscious query scheduling technique and an in-memory graph representation, and to maximize cache performance and minimize cross-core/socket communication overhead, we address both inter- and intra-thread data locality. To address the workload balancing issue, this study introduces a dynamic work-stealing mechanism to explore the tradeoff between workload balancing and cache-consciousness. The proposed system was evaluated on a Power7+ system against the IBM Knowledge Repository graph dataset. The results demonstrated both good scalability and throughput. Compared with the basic system that does not perform cache-conscious scheduling, inter-thread scheduling improves throughput by up to 18%. Intra-thread scheduling can further improve throughput by as much as 22%. By enabling dynamic work-stealing, the proposed technique balances workloads across all threads with a low standard deviation of the per-thread processing time.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597917.2597935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Recommendation systems using graph collaborative filtering often require responses in real time and high throughput. Therefore, besides recommendation accuracy, it is critical to study high performance concurrent collaborative filtering on modern platforms. To achieve high performance, we study the graph data locality characteristics of collaborative filtering. Our experiments demonstrate that although an individual graph traversal exhibits poor data locality, multiple queries have a tendency of sharing their data footprints, especially in the case of queries with neighboring root vertices. Such characteristics lead to both inter- and intra-thread data locality, which can be utilized to significantly improve collaborative filtering performance. Based on these observations, we present a cache-conscious system for collaborative filtering on modern multi-socket multicore platforms. In this system, we propose a cache-conscious query scheduling technique and an in-memory graph representation, and to maximize cache performance and minimize cross-core/socket communication overhead, we address both inter- and intra-thread data locality. To address the workload balancing issue, this study introduces a dynamic work-stealing mechanism to explore the tradeoff between workload balancing and cache-consciousness. The proposed system was evaluated on a Power7+ system against the IBM Knowledge Repository graph dataset. The results demonstrated both good scalability and throughput. Compared with the basic system that does not perform cache-conscious scheduling, inter-thread scheduling improves throughput by up to 18%. Intra-thread scheduling can further improve throughput by as much as 22%. By enabling dynamic work-stealing, the proposed technique balances workloads across all threads with a low standard deviation of the per-thread processing time.

查看原文本刊更多论文

多套接字多核系统的缓存感知图协同过滤

采用图协同过滤的推荐系统通常要求实时响应和高吞吐量。因此，除了提高推荐的准确性外，研究现代平台上高性能的并发协同过滤也是至关重要的。为了达到高性能，我们研究了协同过滤的图数据局部性特征。我们的实验表明，尽管单个图遍历表现出较差的数据局部性，但多个查询有共享其数据足迹的趋势，特别是在具有相邻根顶点的查询的情况下。这些特征导致线程间和线程内的数据局部性，可以用来显着提高协同过滤性能。基于这些观察，我们提出了一个现代多套接字多核平台上的缓存意识协同过滤系统。在这个系统中，我们提出了一种缓存敏感的查询调度技术和内存中的图形表示，并且为了最大化缓存性能和最小化跨核/套接字通信开销，我们解决了线程间和线程内数据的局部性。为了解决工作负载平衡问题，本研究引入了一个动态工作窃取机制来探索工作负载平衡和缓存意识之间的权衡。针对IBM Knowledge Repository图数据集，在Power7+系统上对所建议的系统进行了评估。结果显示了良好的可伸缩性和吞吐量。与不执行缓存敏感调度的基本系统相比，线程间调度将吞吐量提高了18%。线程内调度可以进一步提高多达22%的吞吐量。通过启用动态工作窃取，所建议的技术在所有线程之间平衡工作负载，每个线程处理时间的标准偏差很低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 11th ACM Conference on Computing Frontiers

自引率

0.00%

发文量