Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems最新文献

Challenging Sequential Bitstream Processing via Principled Bitwise Speculation 通过有原则的位猜测挑战顺序比特流处理

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378461

Junqiao Qiu, Lin Jiang, Zhijia Zhao

{"title":"Challenging Sequential Bitstream Processing via Principled Bitwise Speculation","authors":"Junqiao Qiu, Lin Jiang, Zhijia Zhao","doi":"10.1145/3373376.3378461","DOIUrl":"https://doi.org/10.1145/3373376.3378461","url":null,"abstract":"Many performance-critical applications traverse bitstreams with bitwise computations for better performance or higher space efficiency, such as multimedia processing and bitmap indexing. However, when these bitwise computations carry dependences, the entire bitstream traversal becomes serial, fundamentally limiting the scalability. In this work, we show that bitstream-carried dependences are actually \"breakable\" in many cases, with the adoption of a systematic treatment - principled bitwise speculation (PBS). The core idea of PBS stems from an analogy drawn between bitstream programs and sequential circuits, both of which transform binary sequences. In this new perspective, it becomes natural to model the dependences in bitstream programs with finite-state machines (FSM), a basic model for sequential circuits. To achieve this, PBS features an assembly of static analyses that reason about bitstream programs down to the bit level to identify the bits causing dependences, then it treats the value combinations of dependent bits as states to construct FSMs. The modeling, for the first time, enables the use of FSM speculation techniques to parallelize bitstream programs. Basically, by leveraging the state convergence of FSMs, the values of dependent bits can be predicted with much higher accuracies. In cases the prediction fails, PBS tries to directly \"rectify\" the wrong outputs based on bitwise logic, minimizing the mis-speculation costs. In addition, FSM shows even higher execution efficiency than the original program in some cases, making itself an optimized version to accelerate serial bitstream processing. We prototyped PBS using LLVM. Evaluation with real-world bitstream programs confirms the effectiveness of PBS, showing up to near-linear speedup on multicore/manycore machines.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124616291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

MERR

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378492

Yuanchao Xu, Yan Solihin, Xipeng Shen

引用次数: 1

The Guardian Council: Parallel Programmable Hardware Security 监护委员会:并行可编程硬件安全

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378463

S. Ainsworth, Timothy M. Jones

{"title":"The Guardian Council: Parallel Programmable Hardware Security","authors":"S. Ainsworth, Timothy M. Jones","doi":"10.1145/3373376.3378463","DOIUrl":"https://doi.org/10.1145/3373376.3378463","url":null,"abstract":"Systems security is becoming more challenging in the face of untrusted programs and system users. Safeguards against attacks currently in use, such as buffer overflows, control-flow integrity, side channels and malware, are limited. Software protection schemes, while flexible, are often too expensive, and hardware schemes, while fast, are too constrained or out-of-date to be practical. We demonstrate the best of both worlds with the Guardian Council, a novel parallel architecture to enforce a wide range of highly customisable and diverse security policies. We leverage heterogeneity and parallelism in the design of our system to perform security enforcement for a large high-performance core on a set of small microcontroller-sized cores. These Guardian Processing Elements (GPEs) are many orders of magnitude more efficient than conventional out-of-order superscalar processors, bringing high-performance security at very low power and area overheads. Alongside these highly parallel cores we provide fixed-function logging and communication units, and a powerful programming model, as part of an architecture designed for security. Evaluation on a range of existing hardware and software protection mechanisms, reimplemented on the Guardian Council, demonstrates the flexibility of our approach with negligible overheads, out-performing prior work in the literature. For instance, 4 GPEs can provide forward control-flow integrity with 0% overhead, while 6 GPEs can provide a full shadow stack at only 2%.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128169587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Capuchin 卷尾

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378505

Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, Xuehai Qian

{"title":"Capuchin","authors":"Xuan Peng, Xuanhua Shi, Hulin Dai, Hai Jin, Weiliang Ma, Qian Xiong, Fan Yang, Xuehai Qian","doi":"10.1145/3373376.3378505","DOIUrl":"https://doi.org/10.1145/3373376.3378505","url":null,"abstract":"In recent years, deep learning has gained unprecedented success in various domains, the key of the success is the larger and deeper deep neural networks (DNNs) that achieved very high accuracy. On the other side, since GPU global memory is a scarce resource, large models also pose a significant challenge due to memory requirement in the training process. This restriction limits the DNN architecture exploration flexibility. In this paper, we propose Capuchin, a tensor-based GPU memory management module that reduces the memory footprint via tensor eviction/prefetching and recomputation. The key feature of Capuchin is that it makes memory management decisions based on dynamic tensor access pattern tracked at runtime. This design is motivated by the observation that the access pattern to tensors is regular during training iterations. Based on the identified patterns, one can exploit the total memory optimization space and offer the fine-grain and flexible control of when and how to perform memory optimization techniques. We deploy Capuchin in a widely-used deep learning framework, Tensorflow, and show that Capuchin can reduce the memory footprint by up to 85% among 6 state-of-the-art DNNs compared to the original Tensorflow. Especially, for the NLP task BERT, the maximum batch size that Capuchin can outperforms Tensorflow and gradient-checkpointing by 7x and 2.1x, respectively. We also show that Capuchin outperforms vDNN and gradient-checkpointing by up to 286% and 55% under the same memory oversubscription.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128200293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 99

IIU

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378521

Jun Heo, Jaeyeon Won, Yejin Lee, Shivam Bharuka, Jaeyoung Jang, Tae Jun Ham, Jae W. Lee

{"title":"IIU","authors":"Jun Heo, Jaeyeon Won, Yejin Lee, Shivam Bharuka, Jaeyoung Jang, Tae Jun Ham, Jae W. Lee","doi":"10.1145/3373376.3378521","DOIUrl":"https://doi.org/10.1145/3373376.3378521","url":null,"abstract":"Inverted index serves as a fundamental data structure for efficient search across various applications such as full-text search engine, document analytics and other information retrieval systems. The storage requirement and query load for these structures have been growing at a rapid rate. Thus, an ideal indexing system should maintain a small index size with a low query processing time. Previous works have mainly focused on using CPUs and GPUs to exploit query parallelism while utilizing state-of-the-art compression schemes to fit the index in memory. However, scaling parallelism to maximally utilize memory bandwidth on these architectures is still challenging. In this work, we present IIU, a novel inverted index processing unit, to optimize the query performance while maintaining a low memory overhead for index storage. To this end, we co-design the indexing scheme and hardware accelerator so that the accelerator can process highly compressed inverted index at a high throughput. In addition, IIU provides flexible interconnects between modules to take advantage of both intra- and inter-query parallelism. Our evaluation using a cycle-level simulator demonstrates that IIU provides an average of 13.8times× query latency reduction and 5.4times× throughput improvement across different query types, while reducing the average energy consumption by 18.6times×, compared to Apache Lucene, a production-grade full-text search framework.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115906321","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints DeepSniffer:一个基于学习架构提示的DNN模型提取框架

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378460

Xing Hu, Ling Liang, Shuangchen Li, Lei Deng, Pengfei Zuo, Yu Ji, Xinfeng Xie, Yufei Ding, Chang Liu, T. Sherwood, Yuan Xie

{"title":"DeepSniffer: A DNN Model Extraction Framework Based on Learning Architectural Hints","authors":"Xing Hu, Ling Liang, Shuangchen Li, Lei Deng, Pengfei Zuo, Yu Ji, Xinfeng Xie, Yufei Ding, Chang Liu, T. Sherwood, Yuan Xie","doi":"10.1145/3373376.3378460","DOIUrl":"https://doi.org/10.1145/3373376.3378460","url":null,"abstract":"As deep neural networks (DNNs) continue their reach into a wide range of application domains, the neural network architecture of DNN models becomes an increasingly sensitive subject, due to either intellectual property protection or risks of adversarial attacks. Previous studies explore to leverage architecture-level events disposed in hardware platforms to extract the model architecture information. They pose the following limitations: requiring a priori knowledge of victim models, lacking in robustness and generality, or obtaining incomplete information of the victim model architecture. Our paper proposes DeepSniffer, a learning-based model extraction framework to obtain the complete model architecture information without any prior knowledge of the victim model. It is robust to architectural and system noises introduced by the complex memory hierarchy and diverse run-time system optimizations. The basic idea of DeepSniffer is to learn the relation between extracted architectural hints (e.g., volumes of memory reads/writes obtained by side-channel or bus snooping attacks) and model internal architectures. Taking GPU platforms as a show case, DeepSniffer conducts model extraction by learning both the architecture-level execution features of kernels and the inter-layer temporal association information introduced by the common practice of DNN design. We demonstrate that DeepSniffer works experimentally in the context of an off-the-shelf Nvidia GPU platform running a variety of DNN models. The extracted models are directly helpful to the attempting of crafting adversarial inputs. Our experimental results show that DeepSniffer achieves a high accuracy of model extraction and thus improves the adversarial attack success rate from 14.6%$sim$25.5% (without network architecture knowledge) to 75.9% (with extracted network architecture). The DeepSniffer project has been released in Github.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131137010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 79

HMC

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1108/eb016039

Michalis Kokologiannakis, Viktor Vafeiadis

引用次数: 1

Cross-Failure Bug Detection in Persistent Memory Programs 持久内存程序中的交叉故障错误检测

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378452

Sihang Liu, Korakit Seemakhupt, Yizhou Wei, T. Wenisch, Aasheesh Kolli, S. Khan

{"title":"Cross-Failure Bug Detection in Persistent Memory Programs","authors":"Sihang Liu, Korakit Seemakhupt, Yizhou Wei, T. Wenisch, Aasheesh Kolli, S. Khan","doi":"10.1145/3373376.3378452","DOIUrl":"https://doi.org/10.1145/3373376.3378452","url":null,"abstract":"Persistent memory (PM) technologies, such as Intel's Optane memory, deliver high performance, byte-addressability, and persistence, allowing programs to directly manipulate persistent data in memory without any OS intermediaries. An important requirement of these programs is that persistent data must remain consistent across a failure, which we refer to as the crash consistency guarantee. However, maintaining crash consistency is not trivial. We identify that a consistent recovery critically depends not only on the execution before the failure, but also on the recovery and resumption after failure. We refer to these stages as the pre- and post-failure execution stages. In order to holistically detect crash consistency bugs, we categorize the underlying causes behind inconsistent recovery due to incorrect interactions between the pre- and post-failure execution. First, a program is not crash-consistent if the post-failure stage reads from locations that are not guaranteed to be persisted in all possible access interleavings during the pre-failure stage -- a type of programming error that leads to a race that we refer to as a cross-failure race. Second, a program is not crash-consistent if the post-failure stage reads persistent data that has been left semantically inconsistent during the pre-failure stage, such as a stale log or uncommitted data. We refer to this type of bugs as a cross-failure semantic bug. Together, they form the cross-failure bugs in PM programs. In this work, we provide XFDetector, a tool that detects cross-failure bugs by automatically injecting failures into the pre-failure execution, and checking for cross-failure races and semantic bugs in the post-failure continuation. XFDetector has detected four new bugs in three pieces of PM software: one of PMDK's examples, a PM-optimized Redis database, and a PMDK library function.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"125 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133799882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Vortex: Extreme-Performance Memory Abstractions for Data-Intensive Streaming Applications 涡旋:数据密集型流应用的高性能内存抽象

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378527

Carson Hanel, Arif Arman, Di Xiao, J. Keech, D. Loguinov

{"title":"Vortex: Extreme-Performance Memory Abstractions for Data-Intensive Streaming Applications","authors":"Carson Hanel, Arif Arman, Di Xiao, J. Keech, D. Loguinov","doi":"10.1145/3373376.3378527","DOIUrl":"https://doi.org/10.1145/3373376.3378527","url":null,"abstract":"Many applications in data analytics, information retrieval, and cluster computing process huge amounts of information. The complexity of involved algorithms and massive scale of data require a programming model that can not only offer a simple abstraction for inputs larger than RAM, but also squeeze maximum performance out of the available hardware. While these are usually conflicting goals, we show that this does not have to be the case for sequentially-processed data, i.e., in streaming applications. We develop a set of algorithms called Vortex that force the application to generate access violations (i.e., page faults) during processing of the stream, which are transparently handled in such a way that creates an illusion of an infinite buffer that fits into a regular C/C++ pointer. This design makes Vortex by far the simplest-to-use and fastest platform for various types of streaming I/O, inter-thread data transfer, and key shuffling. We introduce several such applications -- file I/O wrapper, bounded producer-consumer pipeline, vanishing array, key-partitioning engine, and novel in-place radix sort that is 3-4 times faster than the best prior approaches.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123517501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism 弹性布谷鸟页表:重新思考并行的虚拟内存转换

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2020-03-09 DOI: 10.1145/3373376.3378493

Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, J. Torrellas

{"title":"Elastic Cuckoo Page Tables: Rethinking Virtual Memory Translation for Parallelism","authors":"Dimitrios Skarlatos, Apostolos Kokolis, Tianyin Xu, J. Torrellas","doi":"10.1145/3373376.3378493","DOIUrl":"https://doi.org/10.1145/3373376.3378493","url":null,"abstract":"The unprecedented growth in the memory needs of emerging memory-intensive workloads has made virtual memory translation a major performance bottleneck. To address this problem, this paper introduces Elastic Cuckoo Page Tables, a novel page table design that transforms the sequential pointer-chasing operation used by conventional multi-level radix page tables into fully-parallel look-ups. The resulting design harvests, for the first time, the benefits of memory level parallelism for address translation. Elastic cuckoo page tables use Elastic Cuckoo Hashing, a novel extension of cuckoo hashing that supports efficient page table resizing. Elastic cuckoo page tables efficiently resolve hash collisions, provide process-private page tables, support multiple page sizes and page sharing among processes, and dynamically adapt page table sizes to meet application requirements. We evaluate elastic cuckoo page tables with full-system simulations of an 8-core processor using a set of graph analytics, bioinformatics, HPC, and system workloads. Elastic cuckoo page tables reduce the address translation overhead by an average of 41% over conventional radix page tables. The result is a 3-18% speed-up in application execution.","PeriodicalId":108406,"journal":{"name":"Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124644535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 40