Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems最新文献_第7页

Filtering Translation Bandwidth with Virtual Caching 过滤翻译带宽与虚拟缓存

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173195

Hongil Yoon, Jason Lowe-Power, G. Sohi

{"title":"Filtering Translation Bandwidth with Virtual Caching","authors":"Hongil Yoon, Jason Lowe-Power, G. Sohi","doi":"10.1145/3173162.3173195","DOIUrl":"https://doi.org/10.1145/3173162.3173195","url":null,"abstract":"Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133111845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Session details: Session 2B: Performance Management 会话详细信息:会话2B:性能管理

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3252955

J. Larus

引用次数: 0

The Architectural Implications of Autonomous Driving: Constraints and Acceleration 自动驾驶的架构含义:约束和加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173191

Shi-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E. Haque, Lingjia Tang, Jason Mars

{"title":"The Architectural Implications of Autonomous Driving: Constraints and Acceleration","authors":"Shi-Chieh Lin, Yunqi Zhang, Chang-Hong Hsu, Matt Skach, Md E. Haque, Lingjia Tang, Jason Mars","doi":"10.1145/3173162.3173191","DOIUrl":"https://doi.org/10.1145/3173162.3173191","url":null,"abstract":"Autonomous driving systems have attracted a significant amount of interest recently, and many industry leaders, such as Google, Uber, Tesla, and Mobileye, have invested a large amount of capital and engineering power on developing such systems. Building autonomous driving systems is particularly challenging due to stringent performance requirements in terms of both making the safe operational decisions and finishing processing at real-time. Despite the recent advancements in technology, such systems are still largely under experimentation and architecting end-to-end autonomous driving systems remains an open research question. To investigate this question, we first present and formalize the design constraints for building an autonomous driving system in terms of performance, predictability, storage, thermal and power. We then build an end-to-end autonomous driving system using state-of-the-art award-winning algorithms to understand the design trade-offs for building such systems. In our real-system characterization, we identify three computational bottlenecks, which conventional multicore CPUs are incapable of processing under the identified design constraints. To meet these constraints, we accelerate these algorithms using three accelerator platforms including GPUs, FPGAs, and ASICs, which can reduce the tail latency of the system by 169x, 10x, and 93x respectively. With accelerator-based designs, we are able to build an end-to-end autonomous driving system that meets all the design constraints, and explore the trade-offs among performance, power and the higher accuracy enabled by higher resolution cameras.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130675556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 315

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks 消费类设备的Google工作负载:缓解数据移动瓶颈

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173177

Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, A. Knies, Parthasarathy Ranganathan, O. Mutlu

{"title":"Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks","authors":"Amirali Boroumand, Saugata Ghose, Youngsok Kim, Rachata Ausavarungnirun, Eric Shiu, Rahul Thakur, Daehyun Kim, Aki Kuusela, A. Knies, Parthasarathy Ranganathan, O. Mutlu","doi":"10.1145/3173162.3173177","DOIUrl":"https://doi.org/10.1145/3173162.3173177","url":null,"abstract":"We are experiencing an explosive growth in the number of consumer devices, including smartphones, tablets, web-based computers such as Chromebooks, and wearable devices. For this class of devices, energy efficiency is a first-class concern due to the limited battery capacity and thermal power budget. We find that data movement is a major contributor to the total system energy and execution time in consumer devices. The energy and performance costs of moving data between the memory system and the compute units are significantly higher than the costs of computation. As a result, addressing data movement is crucial for consumer devices. In this work, we comprehensively analyze the energy and performance impact of data movement for several widely-used Google consumer workloads: (1) the Chrome web browser; (2) TensorFlow Mobile, Google's machine learning framework; (3) video playback, and (4) video capture, both of which are used in many video services such as YouTube and Google Hangouts. We find that processing-in-memory (PIM) can significantly reduce data movement for all of these workloads, by performing part of the computation close to memory. Each workload contains simple primitives and functions that contribute to a significant amount of the overall data movement. We investigate whether these primitives and functions are feasible to implement using PIM, given the limited area and power constraints of consumer devices. Our analysis shows that offloading these primitives to PIM logic, consisting of either simple cores or specialized accelerators, eliminates a large amount of data movement, and significantly reduces total system energy (by an average of 55.4% across the workloads) and execution time (by an average of 54.2%).","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"191 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115483296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 233

VAULT: Reducing Paging Overheads in SGX with Efficient Integrity Verification Structures VAULT:通过高效的完整性验证结构减少SGX中的分页开销

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3177155

Meysam Taassori, Ali Shafiee, R. Balasubramonian

{"title":"VAULT: Reducing Paging Overheads in SGX with Efficient Integrity Verification Structures","authors":"Meysam Taassori, Ali Shafiee, R. Balasubramonian","doi":"10.1145/3173162.3177155","DOIUrl":"https://doi.org/10.1145/3173162.3177155","url":null,"abstract":"Intel's SGX offers state-of-the-art security features, including confidentiality, integrity, and authentication (CIA) when accessing sensitive pages in memory. Sensitive pages are placed in an Enclave Page Cache (EPC) within the physical memory before they can be accessed by the processor. To control the overheads imposed by CIA guarantees, the EPC operates with a limited capacity (currently 128 MB). Because of this limited EPC size, sensitive pages must be frequently swapped between EPC and non-EPC regions in memory. A page swap is expensive (about 40K cycles) because it requires an OS system call, page copying, updates to integrity trees and metadata, etc. Our analysis shows that the paging overhead can slow the system on average by 5×, and other studies have reported even higher slowdowns for memory-intensive workloads. The paging overhead can be reduced by growing the size of the EPC to match the size of physical memory, while allowing the EPC to also accommodate non-sensitive pages. However, at least two important problems must be addressed to enable this growth in EPC: (i) the depth of the integrity tree and its cacheability must be improved to keep memory bandwidth overheads in check, (ii) the space overheads of integrity verification (tree and MACs) must be reduced. We achieve both goals by introducing a variable arity unified tree (VAULT) organization that is more compact and has lower depth. We further reduce the space overheads with techniques that combine MAC sharing and compression. With simulations, we show that the combination of our techniques can address most inefficiencies in SGX memory access and improve overall performance by 3.7×, relative to an SGX baseline, while incurring a memory capacity over-head of only 4.7%.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115668301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

DLibOS: Performance and Protection with a Network-on-Chip 基于片上网络的性能和保护

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173209

S. Mallon, V. Gramoli, Guillaume Jourjon

{"title":"DLibOS: Performance and Protection with a Network-on-Chip","authors":"S. Mallon, V. Gramoli, Guillaume Jourjon","doi":"10.1145/3173162.3173209","DOIUrl":"https://doi.org/10.1145/3173162.3173209","url":null,"abstract":"A long body of research work has led to the conjecture that highly efficient IO processing at user-level would necessarily violate protection. In this paper, we debunk this myth by introducing DLibOS a new paradigm that consists of distributing a library OS on specialized cores to achieve performance and protection at the user-level. Its main novelty consists of leveraging network-on-chip to allow hardware message passing, rather than context switches, for communication between different address spaces. To demonstrate the feasibility of our approach, we implement a driver and a network stack at user-level on a Tilera many-core machine. We define a novel asynchronous socket interface and partition the memory such that the reception, the transmission and the application modify isolated regions. Our high performance results of 4.2 and 3.1 million requests per second obtained on a webserver and the Memcached applications, respectively, confirms the relevance of our design decisions. Finally, we compare DLibOS against a non-protected user-level network stack and show that protection comes at a negligible cost.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122659996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Session details: Session 5B Neural Networks 会话细节:会话5B神经网络

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3252961

Adrian Sampson

引用次数: 0

DAMN: Overhead-Free IOMMU Protection for Networking DAMN:无开销IOMMU网络保护

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173175

Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir

{"title":"DAMN: Overhead-Free IOMMU Protection for Networking","authors":"Alex Markuze, I. Smolyar, Adam Morrison, Dan Tsafrir","doi":"10.1145/3173162.3173175","DOIUrl":"https://doi.org/10.1145/3173162.3173175","url":null,"abstract":"DMA operations can access memory buffers only if they are \"mapped\" in the IOMMU, so operating systems protect themselves against malicious/errant network DMAs by mapping and unmapping each packet immediately before/after it is DMAed. This approach was recently found to be riskier and less performant than keeping packets non-DMAable and instead copying their content to/from permanently-mapped buffers. Still, the extra copy hampers performance of multi-gigabit networking. We observe that achieving protection at the DMA (un)map boundary is needlessly constraining, as devices must be prevented from changing the data only after the kernel reads it. So there is no real need to switch ownership of buffers between kernel and device at the DMA (un)mapping layer, as opposed to the approach taken by all existing IOMMU protection schemes. We thus eliminate the extra copy by (1)~implementing a new allocator called DMA-Aware Malloc for Networking (DAMN), which (de)allocates packet buffers from a memory pool permanently mapped in the IOMMU; (2)~modifying the network stack to use this allocator; and (3)~copying packet data only when the kernel needs it, which usually morphs the aforementioned extra copy into the kernel's standard copy operation performed at the user-kernel boundary. DAMN thus provides full IOMMU protection with performance comparable to that of an unprotected system.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131931567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Sugar: Secure GPU Acceleration in Web Browsers Sugar: Web浏览器中的安全GPU加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-03-19 DOI: 10.1145/3173162.3173186

Zhihao Yao, Zongheng Ma, Yingtong Liu, A. A. Sani, Aparna Chandramowlishwaran

{"title":"Sugar: Secure GPU Acceleration in Web Browsers","authors":"Zhihao Yao, Zongheng Ma, Yingtong Liu, A. A. Sani, Aparna Chandramowlishwaran","doi":"10.1145/3173162.3173186","DOIUrl":"https://doi.org/10.1145/3173162.3173186","url":null,"abstract":"Modern personal computers have embraced increasingly powerful Graphics Processing Units (GPUs). Recently, GPU-based graphics acceleration in web apps (i.e., applications running inside a web browser) has become popular. WebGL is the main effort to provide OpenGL-like graphics for web apps and it is currently used in 53% of the top-100 websites. Unfortunately, WebGL has posed serious security concerns as several attack vectors have been demonstrated through WebGL. Web browsers» solutions to these attacks have been reactive: discovered vulnerabilities have been patched and new runtime security checks have been added. Unfortunately, this approach leaves the system vulnerable to zero-day vulnerability exploits, especially given the large size of the Trusted Computing Base of the graphics plane. We present Sugar, a novel operating system solution that enhances the security of GPU acceleration for web apps by design. The key idea behind Sugar is using a dedicated virtual graphics plane for a web app by leveraging modern GPU virtualization solutions. A virtual graphics plane consists of a dedicated virtual GPU (or vGPU) as well as all the software graphics stack (including the device driver). Sugar enhances the system security since a virtual graphics plane is fully isolated from the rest of the system. Despite GPU virtualization overhead, we show that Sugar achieves high performance. Moreover, unlike current systems, Sugar is able to use two underlying physical GPUs, when available, to co-render the User Interface (UI): one GPU is used to provide virtual graphics planes for web apps and the other to provide the primary graphics plane for the rest of the system. Such a design not only provides strong security guarantees, it also provides enhanced performance isolation.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126176508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

VIBNN: Hardware Acceleration of Bayesian Neural Networks 贝叶斯神经网络的硬件加速

Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2018-02-02 DOI: 10.1145/3173162.3173212

R. Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, Yanzhi Wang

{"title":"VIBNN: Hardware Acceleration of Bayesian Neural Networks","authors":"R. Cai, Ao Ren, Ning Liu, Caiwen Ding, Luhao Wang, Xuehai Qian, Massoud Pedram, Yanzhi Wang","doi":"10.1145/3173162.3173212","DOIUrl":"https://doi.org/10.1145/3173162.3173212","url":null,"abstract":"Bayesian Neural Networks (BNNs) have been proposed to address the problem of model uncertainty in training and inference. By introducing weights associated with conditioned probability distributions, BNNs are capable of resolving the overfitting issue commonly seen in conventional neural networks and allow for small-data training, through the variational inference process. Frequent usage of Gaussian random variables in this process requires a properly optimized Gaussian Random Number Generator (GRNG). The high hardware cost of conventional GRNG makes the hardware implementation of BNNs challenging. In this paper, we propose VIBNN, an FPGA-based hardware accelerator design for variational inference on BNNs. We explore the design space for massive amount of Gaussian variable sampling tasks in BNNs. Specifically, we introduce two high performance Gaussian (pseudo) random number generators: 1) the RAM-based Linear Feedback Gaussian Random Number Generator (RLF-GRNG), which is inspired by the properties of binomial distribution and linear feedback logics; and 2) the Bayesian Neural Network-oriented Wallace Gaussian Random Number Generator. To achieve high scalability and efficient memory access, we propose a deep pipelined accelerator architecture with fast execution and good hardware utilization. Experimental results demonstrate that the proposed VIBNN implementations on an FPGA can achieve throughput of 321,543.4 Images/s and energy efficiency upto 52,694.8 Images/J while maintaining similar accuracy as its software counterpart.","PeriodicalId":302876,"journal":{"name":"Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114479895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71