Fast Query Processing by Distributing an Index over CPU Caches

2005 IEEE International Conference on Cluster Computing Pub Date : 2004-10-25 DOI:10.1109/CLUSTR.2005.347047

Xiaoqin Ma, G. Cooperman

{"title":"Fast Query Processing by Distributing an Index over CPU Caches","authors":"Xiaoqin Ma, G. Cooperman","doi":"10.1109/CLUSTR.2005.347047","DOIUrl":null,"url":null,"abstract":"Data intensive applications on clusters often require requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include object tracking in sensor networks, packet routing over the Internet, request processing in publish-subscribe middleware, and query processing in database systems. When the tree structure is larger than the CPU cache, the standard implementation potentially incurs many cache misses for each lookup; one cache miss at each successive level of the tree. As the CPU-RAM gap grows, this performance degradation will only become worse in the future. We propose a solution that takes advantage of the growing speed of local area networks for clusters. We split the sorted tree structure among the nodes of the cluster. We assume that the structure will fit inside the aggregation of the CPU caches of the entire cluster. We then send a word over the network (as part of a larger packet containing other words) in order to examine the tree structure in another node's CPU cache. We show that this is often faster than the standard solution, which locally incurs multiple cache misses while accessing each successive level of the tree. The principle is demonstrated with a cluster configured with Pentium III nodes connected with a Myrinet network. The new approach is shown to be 50% faster on this current cluster. In the future, the new approach is expected to have a still greater advantage as networks grow in speed, and as cache lines grow in length (greater cache miss penalty). This can be used to successfully overcome the inherent memory latency associated with cache misses","PeriodicalId":255312,"journal":{"name":"2005 IEEE International Conference on Cluster Computing","volume":"265 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTR.2005.347047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Data intensive applications on clusters often require requests quickly be sent to the node managing the desired data. In many applications, one must look through a sorted tree structure to determine the responsible node for accessing or storing the data. Examples include object tracking in sensor networks, packet routing over the Internet, request processing in publish-subscribe middleware, and query processing in database systems. When the tree structure is larger than the CPU cache, the standard implementation potentially incurs many cache misses for each lookup; one cache miss at each successive level of the tree. As the CPU-RAM gap grows, this performance degradation will only become worse in the future. We propose a solution that takes advantage of the growing speed of local area networks for clusters. We split the sorted tree structure among the nodes of the cluster. We assume that the structure will fit inside the aggregation of the CPU caches of the entire cluster. We then send a word over the network (as part of a larger packet containing other words) in order to examine the tree structure in another node's CPU cache. We show that this is often faster than the standard solution, which locally incurs multiple cache misses while accessing each successive level of the tree. The principle is demonstrated with a cluster configured with Pentium III nodes connected with a Myrinet network. The new approach is shown to be 50% faster on this current cluster. In the future, the new approach is expected to have a still greater advantage as networks grow in speed, and as cache lines grow in length (greater cache miss penalty). This can be used to successfully overcome the inherent memory latency associated with cache misses

查看原文本刊更多论文

通过在CPU缓存上分配索引来快速处理查询

集群上的数据密集型应用程序通常需要将请求快速发送到管理所需数据的节点。在许多应用程序中，必须查看已排序的树结构，以确定访问或存储数据的负责节点。示例包括传感器网络中的对象跟踪、Internet上的数据包路由、发布-订阅中间件中的请求处理以及数据库系统中的查询处理。当树结构大于CPU缓存时，标准实现可能会在每次查找时导致许多缓存丢失;在树的每个连续级别上都有一次缓存丢失。随着CPU-RAM差距的扩大，这种性能下降在未来只会变得更糟。我们提出了一种利用集群局域网增长速度的解决方案。我们将排序后的树结构在集群的节点之间进行拆分。我们假设该结构将适合整个集群的CPU缓存的聚合。然后我们通过网络发送一个单词(作为包含其他单词的更大数据包的一部分)，以便在另一个节点的CPU缓存中检查树结构。我们表明，这通常比标准解决方案快，标准解决方案在访问树的每个连续级别时在本地导致多次缓存丢失。该原理通过配置了与Myrinet网络连接的Pentium III节点的集群进行了演示。在当前集群上，新方法的速度提高了50%。在未来，随着网络速度的增长和缓存线路长度的增加(缓存丢失的损失更大)，这种新方法预计会有更大的优势。这可以用来成功地克服与缓存丢失相关的固有内存延迟

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE International Conference on Cluster Computing

自引率

0.00%

发文量