Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero
{"title":"Reducing cache coherence traffic with hierarchical directory cache and NUMA-aware runtime scheduling","authors":"Paul Caheny, Marc Casas, Miquel Moretó, Hervé Gloaguen, Maxime Saintes, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/2967938.2967962","DOIUrl":"https://doi.org/10.1145/2967938.2967962","url":null,"abstract":"Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 1.23× to 2.54× and coherence traffic reductions between 44% and 77% in comparison to NUMA-oblivious scheduling and data allocation. Furthermore, we show that the NUMA-aware techniques we employ at the runtime level are crucial to ensure the added hierarchical layer in the directory coherence protocol does not introduce significant coherence traffic to the system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122882233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudarsun Kannan, Moinuddin K. Qureshi, Ada Gavrilovska, K. Schwan
{"title":"Energy aware persistence: Reducing energy overheads of memory-based persistence in NVMs","authors":"Sudarsun Kannan, Moinuddin K. Qureshi, Ada Gavrilovska, K. Schwan","doi":"10.1145/2967938.2967953","DOIUrl":"https://doi.org/10.1145/2967938.2967953","url":null,"abstract":"Next generation byte addressable nonvolatile memories (NVMs) such as PCM, Memristor, and 3D X-Point are attractive solutions for mobile and other end-user devices, as they offer memory scalability as well as fast persistent storage. However, NVM's limitations of slow writes and high write energy are magnified for applications that require atomic, consistent, isolated and durable (ACID) persistence. For maintaining ACID persistence guarantees, applications not only need to do extra writes to NVM but also need to execute a significant number of additional CPU instructions for performing NVM writes in a transactional manner. Our analysis shows that maintaining persistence with ACID guarantees increases CPU energy up to 7.3× and NVM energy up to 5.1× compared to a baseline with no ACID guarantees. For computing platforms such as mobile devices, where energy consumption is a critical factor, it is important that the energy cost of persistence is reduced. To address the energy overheads of persistence with ACID guarantees, we develop novel energy-aware persistence (EAP) principles that identify data durability (logging) as the dominant factor in energy increase. Next, for low energy states, we formulate energy efficient durability techniques that include a mechanism to switch between performance and energy efficient logging modes, support for NVM group commit, and a memory management method that reduces energy by trading capacity via less frequent garbage collection. For critical energy states, we propose a relaxed durability mechanism - ACI-RD - that relaxes data logging without affecting the correctness of an application. Finally, we evaluate EAP's principles with real applications and benchmarks. Our experimental results demonstrate up to 2× reduction in CPU and 2.4× reduction in NVM energy usage compared to the traditional ACID persistence.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114936489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim
{"title":"POSTER: Fly-Over: A light-weight distributed power-gating mechanism for energy-efficient networks-on-chip","authors":"R. Boyapati, Jiayi Huang, Ningyuan Wang, Kyung Hoon Kim, K. H. Yum, Eun Jung Kim","doi":"10.1145/2967938.2974058","DOIUrl":"https://doi.org/10.1145/2967938.2974058","url":null,"abstract":"Reducing static NoC power consumption is becoming critical for energy-efficient computing as technology scales down since NoCs are devouring a large fraction of the on-chip power budget. We propose Fly-Over (FLOV), a light-weight distributed mechanism for power-gating routers. With simple modifications to the baseline router architecture, FLOV links are facilitated over power-gated routers. A Handshake protocol that allows seamless router power-gating in addition to a dynamic routing algorithm, that provides best-effort minimal path without the necessity for global network information, maintain normal NoC functionality. We evaluate our schemes using synthetic workloads as well as real workloads from PARSEC 2.1 benchmark suite. The results show that FLOV can achieve on average 19.2% latency reduction and 15.9% total energy savings.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134423356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rinnegan: Efficient resource use in heterogeneous architectures","authors":"S. Panneerselvam, M. Swift","doi":"10.1145/2967938.2967964","DOIUrl":"https://doi.org/10.1145/2967938.2967964","url":null,"abstract":"Current processors provide a variety of different processing units to improve performance and power efficiency. For example, ARM's big.LITTLE, AMD's APUs, and Oracle's M7 provide heterogeneous processors, on-die GPUs, and on-die accelerators. However, the performance experienced by programs using these processing units can vary widely due to contention from multiprogramming, thermal constraints and other issues. In these systems, the decision of where to execute a task must consider not only execution time of the task, but also current system conditions. We built Rinnegan, a Linux kernel extension and runtime library, to perform scheduling and handle task placement in heterogeneous systems. The Rinnegan kernel extension monitors and reports the utilization of all processing units to applications, which then makes placement decisions at user level. The Rinnegan runtime provides a performance model to predict the speedup and overhead of offloading a task. With this model and the current utilization of processing units, the runtime can select the task placement that best achieves an application's performance goals, such as low latency, high throughput, or real-time deadlines. When integrated with StarPU, a runtime system for heterogeneous architectures, Rinnegan improves StarPU by performing 1.5- 2× better than its native scheduling policies in a shared heterogeneous environment.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132967286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yipeng Wang, Ren Wang, Andrew J. Herdrich, James Tsai, Yan Solihin
{"title":"CAF: Core to core Communication Acceleration Framework","authors":"Yipeng Wang, Ren Wang, Andrew J. Herdrich, James Tsai, Yan Solihin","doi":"10.1145/2967938.2967954","DOIUrl":"https://doi.org/10.1145/2967938.2967954","url":null,"abstract":"As the number of cores in a multicore system increases, core-to-core (C2C) communication is increasingly limiting the performance scaling of workloads that share data frequently. The traditional way cores communicate is by using shared memory space between them. However, shared memory communication fundamentally involves coherence invalidations and cache misses, which cause large performance overheads and incur a high amount of network traffic. Many important workloads incur significant C2C communication and are affected significantly by the costs, including pipelined packet processing which is widely used in software-based networking solutions. In these workloads, threads run on different cores and pass packets from one core to another for different stages of processing using software queues. In this paper, we analyze the behavior and overheads of software queue management. Based on this analysis, we propose a novel C2C Communication Acceleration Framework (CAF) to optimize C2C communication. CAF offloads substantial communication burdens from cores and memory to a designated, efficient hardware device we refer to as Queue Management Device (QMD) attached to the Network on Chip. CAF combines hardware and software optimizations to effectively reduce the queue-induced communication overheads and improve the overall system performance by up to 2-12× over traditional software queue implementations.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126922894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, Tao Li
{"title":"Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small","authors":"Mingcong Song, Yang Hu, Yunlong Xu, Chao Li, Huixiang Chen, Jingling Yuan, Tao Li","doi":"10.1145/2967938.2967944","DOIUrl":"https://doi.org/10.1145/2967938.2967944","url":null,"abstract":"Convolutional Neural Networks (CNNs) have substantially advanced the state-of-the-art accuracies of object recognition, which is the core function of a myriad of modern multimedia processing techniques such as image/video processing, speech recognition, and natural language processing. GPU-based accelerators gained increasing attention because a large amount of highly parallel neurons in CNN naturally matches the GPU computation pattern. In this work, we perform comprehensive experiments to investigate the performance bottlenecks and overheads of current GPU acceleration platform for scale-out CNN-based big data processing. In our characterization, we observe two significant semantic gaps: framework gap that lies between CNN-based data processing workflow and data processing manner in distributed framework; and the standalone gap that lies between the uneven computation loads at different CNN layers and fixed computing capacity provisioning of current GPU acceleration library. To bridge these gaps, we propose D3NN, a Distributed, Decoupled, and Dynamically tuned GPU acceleration framework for modern CNN architectures. In particular, D3NN features a novel analytical model that enables accurate time estimation of GPU accelerated CNN processing with only 5-10% error. Our evaluation results show the throughput of standalone processing node using D3NN gains up to 3.7× performance improvement over current standalone GPU acceleration platform. Our CNN-oriented GPU acceleration library with built-in dynamic batching scheme achieves up to 1.5× performance improvement over the non-batching scheme and outperforms the state-of-the-art deep learning library by up to 28% (performance mode) ~ 67% (memory-efficient mode).","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"133 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116578432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Greater performance and better efficiency: Predicated execution has shown us the way","authors":"Y. Patt","doi":"10.1145/2967938.2970376","DOIUrl":"https://doi.org/10.1145/2967938.2970376","url":null,"abstract":"We have been riding a strong wave of greater and greater performance for decades, to some extent due to the combination of Moore's Law and Dennard scaling. But we are told all this is coming to an end, in part because we cannot continue to double the transistor count on the chip and we cannot run these things at higher and higher frequencies. Much of the silliness promised by multicore is just that, and not the answer. So, what are we to do? It turns out predication gave us the answer more than 30 years ago. Most of us were not paying attention. Today we have no choice. Predication happened because the compiler, the ISA, and the microarchitecture all cooperated so it could happen. That meant breaking the artificial walls in the transformation hierarchy. If we accept this as something we have to do, there are plenty of opportunities (a) for increased performance (attacking latency instead of just multicore bandwidth) and (b) for better energy efficiency. In this talk I hope to point out some of them, and then ask the obvious question: What do we need to do to make this happen?","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124059406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Student research poster: Software out-of-order execution for in-order architectures","authors":"Kim-Anh Tran","doi":"10.1145/2967938.2971466","DOIUrl":"https://doi.org/10.1145/2967938.2971466","url":null,"abstract":"Processor cores are divided into two categories: fast and power-hungry out-of-order processors, and efficient, but slower in-order processors. To achieve high performance with lowenergy budgets, this proposal aims to deliver out-of-order processing by software (SWOOP) on in-order architectures. Problem: A primary cause for slowdown in in-order processors is last-level cache misses (caused by difficult to predict data-dependent loads), resulting in cores stalling. Solution: As loads are non-blocking operations, independent instructions are scheduled to run before the loads return. We execute critical load instructions earlier in the program for a three-fold benefit: increasing memory and instruction level parallelism, and hiding memory latency. Related work: Some instruction scheduling policies attempt to hide memory latency, but scheduling is confined by basic block limits and register pressure. Software pipelining [3] is restricted by dependencies between instructions and decoupled access-execute (DAE) [1] suffers from address re-computation. Unlike EPIC [2] (evolved from VLIW), SWOOP does not require hardware support for predicated execution, speculative loads and their verification, delayed exception handling, memory disambiguation etc.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129015807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Florian Haas, Sebastian Weis, T. Ungerer, Gilles A. Pokam, Youfeng Wu
{"title":"POSTER: Fault-tolerant execution on COTS multi-core processors with hardware transactional memory support","authors":"Florian Haas, Sebastian Weis, T. Ungerer, Gilles A. Pokam, Youfeng Wu","doi":"10.1145/2967938.2974051","DOIUrl":"https://doi.org/10.1145/2967938.2974051","url":null,"abstract":"Software-based fault-tolerance mechanisms can increase the reliability of multi-core CPUs while being cheaper and more flexible than hardware solutions like lockstep architectures. However, checkpoint creation, error detection and correction entail high performance overhead if implemented in software. We propose a software/hardware hybrid approach, which leverages Intel's hardware transactional memory (TSX) to support implicit checkpoint creation and fast rollback. Hardware enhancements are proposed and evaluated, leading to a resulting performance overhead of 19% on average.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123730393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EXCITE-VM: Extending the virtual memory system to support snapshot isolation transactions","authors":"Heiner Litz, Benjamin Braun, D. Cheriton","doi":"10.1145/2967938.2967955","DOIUrl":"https://doi.org/10.1145/2967938.2967955","url":null,"abstract":"Multi-core programming remains a major software development and maintenance challenge because of data races, deadlock, non-deterministic failures and complex performance issues. In this paper, we describe EXCITE-VM, a system that provides snapshot isolation transactions on shared memory to facilitate programming and to improve the performance of parallel applications. With snapshots, an application thread is not exposed to the committed changes of other threads until it receives the updates by explicitly creating a new snapshot. Snapshot isolation enables low overhead lockless read operations and improves fault tolerance by isolating each thread from the transient, uncommitted writes of other threads. This paper describes how EXCITE-VM implements snapshot isolation transactions efficiently by manipulating virtual memory mappings and using a novel copy-on-read mechanism with a customized page cache. Compared to conventional software transactional memory systems, EXCITE-VM provides up to 2.2× performance improvement for the STAMP benchmark suite and up to 1000× speedup for a modified benchmark having long running read-only transactions. Furthermore, EXCITE-VM achieves a 2× performance improvement on a Memcached benchmark and the Yahoo Cloud Server Benchmarks. Finally, EXCITE-VM improves fault tolerance and offers features such as low-overhead concurrent audit and analysis.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"608 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122901598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}