网络搜索的内存层次结构

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA) Pub Date : 2018-02-01 DOI:10.1109/HPCA.2018.00061

Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan

{"title":"网络搜索的内存层次结构","authors":"Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan","doi":"10.1109/HPCA.2018.00061","DOIUrl":null,"url":null,"abstract":"Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.","PeriodicalId":154694,"journal":{"name":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"69","resultStr":"{\"title\":\"Memory Hierarchy for Web Search\",\"authors\":\"Grant Ayers, Jung Ho Ahn, C. Kozyrakis, Parthasarathy Ranganathan\",\"doi\":\"10.1109/HPCA.2018.00061\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.\",\"PeriodicalId\":154694,\"journal\":{\"name\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"volume\":\"42 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"69\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCA.2018.00061\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2018.00061","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 69

摘要

在线数据密集型服务(如搜索)为数十亿用户提供服务，利用数百万个核心，并在数据中心规模的工作负载中占据重要的且不断增长的部分。然而，这些工作负载的复杂性及其专有性质妨碍了对处理器设计权衡进行详细的体系结构评估和优化。我们首次详细研究了当今最大的商业搜索引擎的内存层次结构。我们使用了来自成千上万台部署服务器的纵向研究、单个平台上的系统微架构评估、经过验证的跟踪驱动仿真和性能建模的测量组合——所有这些都是由服务于真实用户请求的生产工作负载驱动的。我们的数据量化了产品搜索和架构社区中常用的基准测试之间的显著差异。我们认为内存层次结构是性能优化的一个重要机会，并提出了关于搜索如何强调缓存层次结构的新见解，无论是对于指令还是数据。我们表明，与传统观点相反，当前缓存层次结构无法捕获的数据存在重要的重用，并讨论了为什么这排除了最先进的平铺和向外扩展架构。基于这些见解，我们重新考虑了一种新的缓存层次结构，该结构针对搜索进行了优化，可以将L3缓存晶体管的低效使用与高性能内核相交换，并添加了一个延迟优化的封装eDRAM L4缓存。与最先进的处理器相比，我们提出的设计性能提高了27%到38%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Memory Hierarchy for Web Search

Online data-intensive services, such as search, serve billions of users, utilize millions of cores, and comprise a significant and growing portion of datacenter-scale workloads. However, the complexity of these workloads and their proprietary nature has precluded detailed architectural evaluations and optimizations of processor design trade-offs. We present the first detailed study of the memory hierarchy for the largest commercial search engine today. We use a combination of measurements from longitudinal studies across tens of thousands of deployed servers, systematic microarchitectural evaluation on individual platforms, validated trace-driven simulation, and performance modeling – all driven by production workloads servicing real-world user requests. Our data quantifies significant differences between production search and benchmarks commonly used in the architecture community. We identify the memory hierarchy as an important opportunity for performance optimization, and present new insights pertaining to how search stresses the cache hierarchy, both for instructions and data. We show that, contrary to conventional wisdom, there is significant reuse of data that is not captured by current cache hierarchies, and discuss why this precludes state-of-the-art tiled and scale-out architectures. Based on these insights, we rethink a new cache hierarchy optimized for search that trades off the inefficient use of L3 cache transistors for higher-performance cores, and adds a latency-optimized on-package eDRAM L4 cache. Compared to state-of-the-art processors, our proposed design performs 27% to 38% better.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)

自引率

0.00%

发文量