Proceedings of the 19th international conference on Architectural support for programming languages and operating systems最新文献

筛选
英文 中文
Scale-out NUMA 扩展NUMA
Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, B. Falsafi, Boris Grot
{"title":"Scale-out NUMA","authors":"Stanko Novakovic, Alexandros Daglis, Edouard Bugnion, B. Falsafi, Boris Grot","doi":"10.1145/2541940.2541965","DOIUrl":"https://doi.org/10.1145/2541940.2541965","url":null,"abstract":"Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as key-value stores and graph-based applications that rely on large irregular data structures. The fine-grained nature of the accesses is a poor match to commodity networking technologies, including RDMA, which incur delays of 10-1000x over local DRAM operations. We introduce Scale-Out NUMA (soNUMA) -- an architecture, programming model, and communication protocol for low-latency, distributed in-memory processing. soNUMA layers an RDMA-inspired programming model directly on top of a NUMA memory fabric via a stateless messaging protocol. To facilitate interactions between the application, OS, and the fabric, soNUMA relies on the remote memory controller -- a new architecturally-exposed hardware block integrated into the node's local coherence hierarchy. Our results based on cycle-accurate full-system simulation show that soNUMA performs remote reads at latencies that are within 4x of local DRAM, can fully utilize the available memory bandwidth, and can issue up to 10M remote memory operations per second per core.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123896228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 154
Finding the limit: examining the potential and complexity of compilation scheduling for JIT-based runtime systems 发现限制:检查基于jit的运行时系统的编译调度的潜力和复杂性
Yufei Ding, Mingzhou Zhou, Zhijia Zhao, Sarah Eisenstat, Xipeng Shen
{"title":"Finding the limit: examining the potential and complexity of compilation scheduling for JIT-based runtime systems","authors":"Yufei Ding, Mingzhou Zhou, Zhijia Zhao, Sarah Eisenstat, Xipeng Shen","doi":"10.1145/2541940.2541945","DOIUrl":"https://doi.org/10.1145/2541940.2541945","url":null,"abstract":"This work aims to find out the full potential of compilation scheduling for JIT-based runtime systems. Compilation scheduling determines the order in which the compilation units (e.g., functions) in a program are to be compiled or recompiled. It decides when what versions of the units are ready to run, and hence affects performance. But it has been a largely overlooked direction in JIT-related research, with some fundamental questions left open: How significant compilation scheduling is for performance, how good the scheduling schemes employed by existing runtime systems are, and whether a great potential exists for improvement. This study proves the strong NP-completeness of the problem, proposes a heuristic algorithm that yields near optimal schedules, examines the potential of two current scheduling schemes empirically, and explores the relations with JIT designs. It provides the first principled understanding to the complexity and potential of compilation scheduling, shedding some insights for JIT-based runtime system improvement.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124335868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
ASC: automatically scalable computation ASC:自动伸缩计算
Amos Waterland, E. Angelino, Ryan P. Adams, J. Appavoo, M. Seltzer
{"title":"ASC: automatically scalable computation","authors":"Amos Waterland, E. Angelino, Ryan P. Adams, J. Appavoo, M. Seltzer","doi":"10.1145/2541940.2541985","DOIUrl":"https://doi.org/10.1145/2541940.2541985","url":null,"abstract":"We present an architecture designed to transparently and automatically scale the performance of sequential programs as a function of the hardware resources available. The architecture is predicated on a model of computation that views program execution as a walk through the enormous state space composed of the memory and registers of a single-threaded processor. Each instruction execution in this model moves the system from its current point in state space to a deterministic subsequent point. We can parallelize such execution by predictively partitioning the complete path and speculatively executing each partition in parallel. Accurately partitioning the path is a challenging prediction problem. We have implemented our system using a functional simulator that emulates the x86 instruction set, including a collection of state predictors and a mechanism for speculatively executing threads that explore potential states along the execution path. While the overhead of our simulation makes it impractical to measure speedup relative to native x86 execution, experiments on three benchmarks show scalability of up to a factor of 256 on a 1024 core machine when executing unmodified sequential programs.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123509615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
RelaxReplay: record and replay for relaxed-consistency multiprocessors 松弛一致性多处理器的记录和重放
N. Honarmand, J. Torrellas
{"title":"RelaxReplay: record and replay for relaxed-consistency multiprocessors","authors":"N. Honarmand, J. Torrellas","doi":"10.1145/2541940.2541979","DOIUrl":"https://doi.org/10.1145/2541940.2541979","url":null,"abstract":"Record and Deterministic Replay (RnR) of multithreaded programs on relaxed-consistency multiprocessors has been a long-standing problem. While there are designs that work for Total Store Ordering (TSO), finding a general solution that is able to record the access reordering allowed by any relaxed-consistency model has proved challenging. This paper presents the first complete solution for hard-ware-assisted memory race recording that works for any relaxed-consistency model of current processors. With the scheme, called RelaxReplay, we can build an RnR system for any relaxed-consistency model and coherence protocol. RelaxReplay's core innovation is a new way of capturing memory access reordering. Each memory instruction goes through a post-completion in-order counting step that detects any reordering, and efficiently records it. We evaluate RelaxReplay with simulations of an 8-core release-consistent multicore running SPLASH-2 programs. We observe that RelaxReplay induces negligible overhead during recording. In addition, the average size of the log produced is comparable to the log sizes reported for existing solutions, and still very small compared to the memory bandwidth of modern machines. Finally, deterministic replay is efficient and needs minimal hardware support.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128729023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems aaa:一种基于非ssd的自主全闪存阵列,用于高性能存储系统
Myoungsoo Jung, Wonil Choi, J. Shalf, M. Kandemir
{"title":"Triple-A: a Non-SSD based autonomic all-flash array for high performance storage systems","authors":"Myoungsoo Jung, Wonil Choi, J. Shalf, M. Kandemir","doi":"10.1145/2541940.2541953","DOIUrl":"https://doi.org/10.1145/2541940.2541953","url":null,"abstract":"Solid State Disk (SSD) arrays are in a position to (as least partially) replace spinning disk arrays in high performance computing (HPC) systems due to their better performance and lower power consumption. However, these emerging SSD arrays are facing enormous challenges, which are not observed in disk-based arrays. Specifically, we observe that the performance of SSD arrays can significantly degrade due to various array-level resource contentions. In addition, their maintenance costs exponentially increase over time, which renders them difficult to deploy widely in HPC systems. To address these challenges, we propose Triple-A, a non-SSD based Autonomic All-Flash Array, which is a self-optimizing, from-scratch NAND flash cluster. Triple-A can detect two different types of resource contentions and autonomically alleviate them by reshaping the physical data-layout on its flash array network. Our experimental evaluation using both real workloads and a micro-benchmark show that Triple-A can offer a 53% higher sustained throughput and a 80% lower I/O latency than non-autonomic SSD arrays.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"57 32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129036031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs 利用单周期多跳noc的位置无关缓存组织
Woo-Cheol Kwon, T. Krishna, L. Peh
{"title":"Locality-oblivious cache organization leveraging single-cycle multi-hop NoCs","authors":"Woo-Cheol Kwon, T. Krishna, L. Peh","doi":"10.1145/2541940.2541976","DOIUrl":"https://doi.org/10.1145/2541940.2541976","url":null,"abstract":"Locality has always been a critical factor in on-chip data placement on CMPs as accessing further-away caches has in the past been more costly than accessing nearby ones. Substantial research on locality-aware designs have thus focused on keeping a copy of the data private. However, this complicatesthe problem of data tracking and search/invalidation; tracking the state of a line at all on-chip caches at a directory or performing full-chip broadcasts are both non-scalable and extremely expensive solutions. In this paper, we make the case for Locality-Oblivious Cache Organization (LOCO), a CMP cache organization that leverages the on-chip network to create virtual single-cycle paths between distant caches, thus redefining the notion of locality. LOCO is a clustered cache organization, supporting both homogeneous and heterogeneous cluster sizes, and provides near single-cycle accesses to data anywhere within the cluster, just like a private cache. Globally, LOCO dynamically creates a virtual mesh connecting all the clusters, and performs an efficient global data search and migration over this virtual mesh, without having to resort to full-chip broadcasts or perform expensive directory lookups. Trace-driven and full system simulations running SPLASH-2 and PARSEC benchmarks show that LOCO improves application run time by up to 44.5% over baseline private and shared cache.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116442511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Transactionalizing legacy code: an experience report using GCC and Memcached 处理遗留代码:使用GCC和Memcached的体验报告
Wenjia Ruan, Trilok Vyas, Yujie Liu, Michael F. Spear
{"title":"Transactionalizing legacy code: an experience report using GCC and Memcached","authors":"Wenjia Ruan, Trilok Vyas, Yujie Liu, Michael F. Spear","doi":"10.1145/2541940.2541960","DOIUrl":"https://doi.org/10.1145/2541940.2541960","url":null,"abstract":"The addition of transactional memory (TM) support to existing languages provides the opportunity to create new soft- ware from scratch using transactions, and also to simplify or extend legacy code by replacing existing synchronization with language-level transactions. In this paper, we describe our experiences transactionalizing the memcached application through the use of the GCC implementation of the Draft C++ TM Specification. We present experiences and recommendations that we hope will guide the effort to integrate TM into languages, and that may also contribute to the growing collective knowledge about how programmers can begin to exploit TM in existing production-quality software.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127174859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 163
Session details: Debate 会议详情:辩论
D. Wood
{"title":"Session details: Debate","authors":"D. Wood","doi":"10.1145/3260933","DOIUrl":"https://doi.org/10.1145/3260933","url":null,"abstract":"","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121235961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EnCore: exploiting system environment and correlation information for misconfiguration detection EnCore:利用系统环境和相关信息进行错误配置检测
Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, Yuanyuan Zhou
{"title":"EnCore: exploiting system environment and correlation information for misconfiguration detection","authors":"Jiaqi Zhang, Lakshminarayanan Renganarayana, Xiaolan Zhang, Niyu Ge, Vasanth Bala, Tianyin Xu, Yuanyuan Zhou","doi":"10.1145/2541940.2541983","DOIUrl":"https://doi.org/10.1145/2541940.2541983","url":null,"abstract":"As software systems become more complex and configurable, failures due to misconfigurations are becoming a critical problem. Such failures often have serious functionality, security and financial consequences. Further, diagnosis and remediation for such failures require reasoning across the software stack and its operating environment, making it difficult and costly. We present a framework and tool called EnCore to automatically detect software misconfigurations. EnCore takes into account two important factors that are unexploited before: the interaction between the configuration settings and the executing environment, as well as the rich correlations between configuration entries. We embrace the emerging trend of viewing systems as data, and exploit this to extract information about the execution environment in which a configuration setting is used. EnCore learns configuration rules from a given set of sample configurations. With training data enriched with the execution context of configurations, EnCore is able to learn a broad set of configuration anomalies that spans the entire system. EnCore is effective in detecting both injected errors and known real-world problems - it finds 37 new misconfigurations in Amazon EC2 public images and 24 new configuration problems in a commercial private cloud. By systematically exploiting environment information and by learning correlation rules across multiple configuration settings, EnCore detects 1.6x to 3.5x more misconfiguration anomalies than previous approaches.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122758141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 115
KVM/ARM: the design and implementation of the linux ARM hypervisor KVM/ARM: linux ARM hypervisor的设计与实现
Chris Dall, Jason Nieh
{"title":"KVM/ARM: the design and implementation of the linux ARM hypervisor","authors":"Chris Dall, Jason Nieh","doi":"10.1145/2541940.2541946","DOIUrl":"https://doi.org/10.1145/2541940.2541946","url":null,"abstract":"As ARM CPUs become increasingly common in mobile devices and servers, there is a growing demand for providing the benefits of virtualization for ARM-based devices. We present our experiences building the Linux ARM hypervisor, KVM/ARM, the first full system ARM virtualization solution that can run unmodified guest operating systems on ARM multicore hardware. KVM/ARM introduces split-mode virtualization, allowing a hypervisor to split its execution across CPU modes and be integrated into the Linux kernel. This allows KVM/ARM to leverage existing Linux hardware support and functionality to simplify hypervisor development and maintainability while utilizing recent ARM hardware virtualization extensions to run virtual machines with comparable performance to native execution. KVM/ARM has been successfully merged into the mainline Linux kernel, ensuring that it will gain wide adoption as the virtualization platform of choice for ARM. We provide the first measurements on real hardware of a complete hypervisor using ARM hardware virtualization support. Our results demonstrate that KVM/ARM has modest virtualization performance and power costs, and can achieve lower performance and power costs compared to x86-based Linux virtualization on multicore hardware.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 207
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信