{"title":"Fairness-oriented OS Scheduling Support for Multicore Systems","authors":"Changdae Kim, Jaehyuk Huh","doi":"10.1145/2925426.2926262","DOIUrl":"https://doi.org/10.1145/2925426.2926262","url":null,"abstract":"Although traditional CPU scheduling efficiently utilizes multiple cores with equal computing capacity, the advent of multicores with diverse capabilities pose challenges to CPU scheduling. For the multi-cores with uneven computing capability, scheduling is essential to exploit the efficiency of core asymmetry, by matching each application with the best core type. However, in addition to the efficiency, an important aspect of CPU scheduling is fairness in CPU provisioning. Such uneven core capability is inherently unfair to threads and causes performance variance, as applications running on fast cores receive higher capability than applications on slow cores. Depending on co-running applications and scheduling decisions, the performance of an application may vary significantly. This study investigates the fairness problem in multi-cores with uneven capability, and explores the design space of OS schedulers supporting multiple fairness constraints. In this paper, we consider two fairness-oriented constraints, minimum fairness for the minimum guaranteed performance and uniformity for performance variation reduction. This study proposes three scheduling policies which guarantee a minimum performance bound while improving the overall throughput and reducing performance variation too. The three proposed fairness-oriented schedulers are implemented for the Linux kernel with an online application monitoring technique. Using an emulated asymmetric multi-core with frequency scaling and a real asymmetric multi-core with the big.LITTLE architecture, the paper shows that the proposed schedulers can effectively support the specified fairness while improving overall system throughput.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127361882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantina Mitropoulou, Vasileios Porpodas, Xiaochun Zhang, Timothy M. Jones
{"title":"Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication","authors":"Konstantina Mitropoulou, Vasileios Porpodas, Xiaochun Zhang, Timothy M. Jones","doi":"10.1145/2925426.2926274","DOIUrl":"https://doi.org/10.1145/2925426.2926274","url":null,"abstract":"Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While these queues perform well for coarse-grained data transfers, they perform poorly in the fine-grained case. This paper proposes Lynx, a novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and dequeue. To achieve this it relies on existing commodity processor hardware and operating system exception handling support to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57x in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Real applications using Lynx get a performance improvement of up to 1.4x.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131529208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing","authors":"Linnan Wang, Wei Wu, Jianxiong Xiao, Yezhou Yang","doi":"10.1145/2925426.2926256","DOIUrl":"https://doi.org/10.1145/2925426.2926256","url":null,"abstract":"Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, inefficient communication, insufficient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peer-to-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129795970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}