ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture Pub Date : 2010-04-01 DOI:10.1109/HPCA.2010.5416658

Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter

{"title":"ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers","authors":"Yoongu Kim, Dongsu Han, O. Mutlu, Mor Harchol-Balter","doi":"10.1109/HPCA.2010.5416658","DOIUrl":null,"url":null,"abstract":"Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.","PeriodicalId":368621,"journal":{"name":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"429","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2010.5416658","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 429

Abstract

Modern chip multiprocessor (CMP) systems employ multiple memory controllers to control access to main memory. The scheduling algorithm employed by these memory controllers has a significant effect on system throughput, so choosing an efficient scheduling algorithm is important. The scheduling algorithm also needs to be scalable — as the number of cores increases, the number of memory controllers shared by the cores should also increase to provide sufficient bandwidth to feed the cores. Unfortunately, previous memory scheduling algorithms are inefficient with respect to system throughput and/or are designed for a single memory controller and do not scale well to multiple memory controllers, requiring significant finegrained coordination among controllers. This paper proposes ATLAS (Adaptive per-Thread Least-Attained-Service memory scheduling), a fundamentally new memory scheduling technique that improves system throughput without requiring significant coordination among memory controllers. The key idea is to periodically order threads based on the service they have attained from the memory controllers so far, and prioritize those threads that have attained the least service over others in each period. The idea of favoring threads with least-attained-service is borrowed from the queueing theory literature, where, in the context of a single-server queue it is known that least-attained-service optimally schedules jobs, assuming a Pareto (or any decreasing hazard rate) workload distribution. After verifying that our workloads have this characteristic, we show that our implementation of least-attained-service thread prioritization reduces the time the cores spend stalling and significantly improves system throughput. Furthermore, since the periods over which we accumulate the attained service are long, the controllers coordinate very infrequently to form the ordering of threads, thereby making ATLAS scalable to many controllers. We evaluate ATLAS on a wide variety of multiprogrammed SPEC 2006 workloads and systems with 4–32 cores and 1–16 memory controllers, and compare its performance to five previously proposed scheduling algorithms. Averaged over 32 workloads on a 24-core system with 4 controllers, ATLAS improves instruction throughput by 10.8%, and system throughput by 8.4%, compared to PAR-BS, the best previous CMP memory scheduling algorithm. ATLAS's performance benefit increases as the number of cores increases.

查看原文本刊更多论文

ATLAS:用于多个内存控制器的可扩展和高性能调度算法

现代芯片多处理器(CMP)系统采用多个存储器控制器来控制对主存储器的访问。这些内存控制器所采用的调度算法对系统吞吐量有很大的影响，因此选择一种有效的调度算法非常重要。调度算法还需要是可伸缩的——随着内核数量的增加，内核共享的内存控制器的数量也应该增加，以提供足够的带宽来满足内核。不幸的是，以前的内存调度算法在系统吞吐量方面效率低下，或者是为单个内存控制器设计的，不能很好地扩展到多个内存控制器，需要控制器之间的大量细粒度协调。本文提出了一种新的内存调度技术ATLAS (Adaptive per-Thread least -达不到的服务内存调度)，它可以在不需要内存控制器之间进行大量协调的情况下提高系统吞吐量。关键思想是根据线程迄今为止从内存控制器获得的服务周期性地对线程进行排序，并在每个周期内优先考虑获得最少服务的线程。支持具有最小可达服务的线程的思想是从队列理论文献中借鉴来的，其中，在单服务器队列的上下文中，已知最小可达服务最优地调度作业，假设Pareto(或任何降低的风险率)工作负载分布。在验证我们的工作负载具有此特征之后，我们展示了我们的最小可达到服务线程优先级的实现减少了内核花费在停机上的时间，并显著提高了系统吞吐量。此外，由于我们积累获得的服务的周期很长，控制器很少协调以形成线程的顺序，从而使ATLAS可扩展到许多控制器。我们在多种多编程SPEC 2006工作负载和具有4-32核和1-16内存控制器的系统上评估了ATLAS，并将其性能与先前提出的五种调度算法进行了比较。与PAR-BS(以前最好的CMP内存调度算法)相比，ATLAS在具有4个控制器的24核系统上平均超过32个工作负载，将指令吞吐量提高了10.8%，将系统吞吐量提高了8.4%。ATLAS的性能优势随着内核数量的增加而增加。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture

自引率

0.00%

发文量