{"title":"Automated OS-level Device Runtime Power Management","authors":"Chao Xu, F. Lin, Yuyang Wang, Lin Zhong","doi":"10.1145/2694344.2694360","DOIUrl":"https://doi.org/10.1145/2694344.2694360","url":null,"abstract":"Non-CPU devices on a modern system-on-a-chip (SoC), ranging from accelerators to I/O controllers, account for a significant portion of the chip area. It is therefore vital for system energy efficiency that idle devices can enter a low-power state while still meeting the performance expectation. This is called device runtime Power Management (PM) for which individual device drivers in commodity OSes are held responsible today. Based on the observations of existing drivers and their evolution, we consider it harmful to rely on drivers for device runtime PM. This paper identifies three pieces of information as essential to device runtime PM, and shows that they can be obtained without involving drivers, either by using a software-only approach, or more efficiently, by adding one register bit to each device. We thus suggest a structural change to the current Linux runtime PM framework, replacing the PM code in all applicable drivers with a single kernel module called the central PM agent. Experimental evaluations show that the central PM agent is just as effective as hand-tuned driver PM code. The paper also presents a tool called PowerAdvisor that simplifies driver PM efforts under the current Linux runtime PM framework. PowerAdvisor analyzes execution traces and suggests where to insert PM calls in driver source code. Despite being a best-effort tool, PowerAdvisor not only reproduces hand-tuned PM code from stock drivers, but also correctly suggests PM code never known before. Overall, our experience shows that it is promising to ultimately free driver developers from manual PM.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116210994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiying Zhang, Jian Yang, Amirsaman Memaripour, S. Swanson
{"title":"Mojim: A Reliable and Highly-Available Non-Volatile Memory System","authors":"Yiying Zhang, Jian Yang, Amirsaman Memaripour, S. Swanson","doi":"10.1145/2694344.2694370","DOIUrl":"https://doi.org/10.1145/2694344.2694370","url":null,"abstract":"Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage systems. These high-performance storage systems would be especially useful in large-scale data center environments where reliability and availability are critical. However, providing reliability and availability to NVMM is challenging, since the latency of data replication can overwhelm the low latency that NVMM should provide. We propose Mojim, a system that provides the reliability and availability that large-scale storage systems require, while preserving the performance of NVMM. Mojim achieves these goals by using a two-tier architecture in which the primary tier contains a mirrored pair of nodes and the secondary tier contains one or more secondary backup nodes with weakly consistent copies of data. Mojim uses highly-optimized replication protocols, software, and networking stacks to minimize replication costs and expose as much of NVMM?s performance as possible. We evaluate Mojim using raw DRAM as a proxy for NVMM and using an industrial NVMM emulation system. We find that Mojim provides replicated NVMM with similar or even better performance than un-replicated NVMM (reducing latency by 27% to 63% and delivering between 0.4 to 2.7X the throughput). We demonstrate that replacing MongoDB's built-in replication system with Mojim improves MongoDB's performance by 3.4 to 4X.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 7B: GPUs","authors":"O. Ozturk","doi":"10.1145/3251036","DOIUrl":"https://doi.org/10.1145/3251036","url":null,"abstract":"","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114148099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 4A: Energy","authors":"Arrvindh Shriraman","doi":"10.1145/3251032","DOIUrl":"https://doi.org/10.1145/3251032","url":null,"abstract":"","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114694065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 2B: Warehouse Scale Computing I","authors":"Lingjia Tang","doi":"10.1145/3251029","DOIUrl":"https://doi.org/10.1145/3251029","url":null,"abstract":"","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114161708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jeff Heckey, S. Patil, Ali JavadiAbhari, Adam Holmes, Daniel Kudrow, K. Brown, D. Franklin, F. Chong, M. Martonosi
{"title":"Compiler Management of Communication and Parallelism for Quantum Computation","authors":"Jeff Heckey, S. Patil, Ali JavadiAbhari, Adam Holmes, Daniel Kudrow, K. Brown, D. Franklin, F. Chong, M. Martonosi","doi":"10.1145/2694344.2694357","DOIUrl":"https://doi.org/10.1145/2694344.2694357","url":null,"abstract":"Quantum computing (QC) offers huge promise to accelerate a range of computationally intensive benchmarks. Quantum computing is limited, however, by the challenges of decoherence: i.e., a quantum state can only be maintained for short windows of time before it decoheres. While quantum error correction codes can protect against decoherence, fast execution time is the best defense against decoherence, so efficient architectures and effective scheduling algorithms are necessary. This paper proposes the Multi-SIMD QC architecture and then proposes and evaluates effective schedulers to map benchmark descriptions onto Multi-SIMD architectures. The Multi-SIMD model consists of a small number of SIMD regions, each of which may support operations on up to thousands of qubits per cycle. Efficient Multi-SIMD operation requires efficient scheduling. This work develops schedulers to reduce communication requirements of qubits between operating regions, while also improving parallelism.We find that communication to global memory is a dominant cost in QC. We also note that many quantum benchmarks have long serial operation paths (although each operation may be data parallel). To exploit this characteristic, we introduce Longest-Path-First Scheduling (LPFS) which pins operations to SIMD regions to keep data in-place and reduce communication to memory. The use of small, local scratchpad memories also further reduces communication. Our results show a 3% to 308% improvement for LPFS over conventional scheduling algorithms, and an additional 3% to 64% improvement using scratchpad memories. Our work is the most comprehensive software-to-quantum toolflow published to date, with efficient and practical scheduling techniques that reduce communication and increase parallelism for full-scale quantum code executing up to a trillion quantum gate operations.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115449220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lokesh Gidra, Gaël Thomas, Julien Sopena, M. Shapiro, Nhan Nguyen
{"title":"NumaGiC: a Garbage Collector for Big Data on Big NUMA Machines","authors":"Lokesh Gidra, Gaël Thomas, Julien Sopena, M. Shapiro, Nhan Nguyen","doi":"10.1145/2694344.2694361","DOIUrl":"https://doi.org/10.1145/2694344.2694361","url":null,"abstract":"On contemporary cache-coherent Non-Uniform Memory Access (ccNUMA) architectures, applications with a large memory footprint suffer from the cost of the garbage collector (GC), because, as the GC scans the reference graph, it makes many remote memory accesses, saturating the interconnect between memory nodes. We address this problem with NumaGiC, a GC with a mostly-distributed design. In order to maximise memory access locality during collection, a GC thread avoids accessing a different memory node, instead notifying a remote GC thread with a message; nonetheless, NumaGiC avoids the drawbacks of a pure distributed design, which tends to decrease parallelism. We compare NumaGiC with Parallel Scavenge and NAPS on two different ccNUMA architectures running on the Hotspot Java Virtual Machine of OpenJDK 7. On Spark and Neo4j, two industry-strength analytics applications, with heap sizes ranging from 160GB to 350GB, and on SPECjbb2013 and SPECjbb2005, ourgc improves overall performance by up to 45% over NAPS (up to 94% over Parallel Scavenge), and increases the performance of the collector itself by up to 3.6x over NAPS (up to 5.4x over Parallel Scavenge).","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124604881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SD-PCM: Constructing Reliable Super Dense Phase Change Memory under Write Disturbance","authors":"Rujia Wang, Lei Jiang, Youtao Zhang, Jun Yang","doi":"10.1145/2694344.2694352","DOIUrl":"https://doi.org/10.1145/2694344.2694352","url":null,"abstract":"Phase Change Memory (PCM) has better scalability and smaller cell size comparing to DRAM. However, further scaling PCM cell in deep sub-micron regime results in significant thermal based write disturbance (WD). Naively allocating large inter-cell space increases cell size from 4F2 ideal to 12F2. While a recent work mitigates WD along word-lines through disturbance resilient data encoding, it is ineffective for WD along bit-lines, which is more severe due to widely adopted $mu$Trench structure in constructing PCM cell arrays. Without mitigating WD along bit-lines, a PCM cell still has 8F2, which is 100% larger than the ideal. In this paper, we propose SD-PCM for achieving reliable write operations in super dense PCM. In particular, we focus on mitigating WD along bit-lines such that we can construct super dense PCM chips with 4F2 cell size, i.e., the minimal for diode-switch based PCM. Based on simple verification-n-correction (VnC), we propose LazyCorrection and PreRead to effectively reduce VnC overhead and minimize cascading verification during write. We further propose (n:m)-Alloc for achieving good tradeoff between VnC overhead minimization and memory capacity loss. Our experimental results show that, comparing to a WD-free low density PCM, SD-PCM achieves 80% capacity improvement in cell arrays while incurring around 0-10% performance degradation when using different (n:m) allocators.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124608518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Asynchronized Concurrency: The Secret to Scaling Concurrent Search Data Structures","authors":"Tudor David, R. Guerraoui, Vasileios Trigonakis","doi":"10.1145/2694344.2694359","DOIUrl":"https://doi.org/10.1145/2694344.2694359","url":null,"abstract":"We introduce \"asynchronized concurrency (ASCY),\" a paradigm consisting of four complementary programming patterns. ASCY calls for the design of concurrent search data structures (CSDSs) to resemble that of their sequential counterparts. We argue that ASCY leads to implementations which are portably scalable: they scale across different types of hardware platforms, including single and multi-socket ones, for various classes of workloads, such as read-only and read-write, and according to different performance metrics, including throughput, latency, and energy. We substantiate our thesis through the most exhaustive evaluation of CSDSs to date, involving 6 platforms, 22 state-of-the-art CSDS algorithms, 10 re-engineered state-of-the-art CSDS algorithms following the ASCY patterns, and 2 new CSDS algorithms designed with ASCY in mind. We observe up to 30% improvements in throughput in the re-engineered algorithms, while our new algorithms out-perform the state-of-the-art alternatives.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133623042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Session 5B: Approximation","authors":"S. Blackburn","doi":"10.1145/3251038","DOIUrl":"https://doi.org/10.1145/3251038","url":null,"abstract":"","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134304455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}