Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems最新文献_第4页

Session details: Session 1A: Persistent Memory 会话详细信息:会话1A: Persistent Memory

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/3251026

Angela Demke Brown

引用次数: 0

ApproxHadoop: Bringing Approximations to MapReduce Frameworks ApproxHadoop:为MapReduce框架带来近似

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694351

Íñigo Goiri, R. Bianchini, Santosh Nagarakatte, Thu D. Nguyen

{"title":"ApproxHadoop: Bringing Approximations to MapReduce Frameworks","authors":"Íñigo Goiri, R. Bianchini, Santosh Nagarakatte, Thu D. Nguyen","doi":"10.1145/2694344.2694351","DOIUrl":"https://doi.org/10.1145/2694344.2694351","url":null,"abstract":"We propose and evaluate a framework for creating and running approximation-enabled MapReduce programs. Specifically, we propose approximation mechanisms that fit naturally into the MapReduce paradigm, including input data sampling, task dropping, and accepting and running a precise and a user-defined approximate version of the MapReduce code. We then show how to leverage statistical theories to compute error bounds for popular classes of MapReduce programs when approximating with input data sampling and/or task dropping. We implement the proposed mechanisms and error bound estimations in a prototype system called ApproxHadoop. Our evaluation uses MapReduce applications from different domains, including data analytics, scientific computing, video encoding, and machine learning. Our results show that ApproxHadoop can significantly reduce application execution time and/or energy consumption when the user is willing to tolerate small errors. For example, ApproxHadoop can reduce runtimes by up to 32x when the user can tolerate an error of 1% with 95% confidence. We conclude that our framework and system can make approximation easily accessible to many application domains using the MapReduce model.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126587749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 195

Architectural Support for Dynamic Linking 动态链接的体系结构支持

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694392

Varun Agrawal, Abhiroop Dabral, Tapti Palit, Yongming Shen, M. Ferdman

{"title":"Architectural Support for Dynamic Linking","authors":"Varun Agrawal, Abhiroop Dabral, Tapti Palit, Yongming Shen, M. Ferdman","doi":"10.1145/2694344.2694392","DOIUrl":"https://doi.org/10.1145/2694344.2694392","url":null,"abstract":"All software in use today relies on libraries, including standard libraries (e.g., C, C++) and application-specific libraries (e.g., libxml, libpng). Most libraries are loaded in memory and dynamically linked when programs are launched, resolving symbol addresses across the applications and libraries. Dynamic linking has many benefits: It allows code to be reused between applications, conserves memory (because only one copy of a library is kept in memory for all the applications that share it), and allows libraries to be patched and updated without modifying programs, among numerous other benefits. However, these benefits come at the cost of performance. For every call made to a function in a dynamically linked library, a trampoline is used to read the function address from a lookup table and branch to the function, incurring memory load and branch operations. Static linking avoids this performance penalty, but loses all the benefits of dynamic linking. Given its myriad benefits, dynamic linking is the predominant choice today, despite the performance cost. In this work, we propose a speculative hardware mechanism to optimize dynamic linking by avoiding executing the trampolines for library function calls, providing the benefits of dynamic linking with the performance of static linking. Speculatively skipping the memory load and branch operations of the library call trampolines improves performance by reducing the number of executed instructions and gains additional performance by reducing pressure on the instruction and data caches, TLBs, and branch predictors. Because the indirect targets of library call trampolines do not change during program execution, our speculative mechanism never misspeculates in practice. We evaluate our technique on real hardware with production software and observe up to 4% speedup using only 1.5KB of on-chip storage.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128967204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Asymmetric Memory Fences: Optimizing Both Performance and Implementability 不对称内存栅栏:优化性能和可实现性

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694388

Yuelu Duan, N. Honarmand, J. Torrellas

{"title":"Asymmetric Memory Fences: Optimizing Both Performance and Implementability","authors":"Yuelu Duan, N. Honarmand, J. Torrellas","doi":"10.1145/2694344.2694388","DOIUrl":"https://doi.org/10.1145/2694344.2694388","url":null,"abstract":"There have been several recent efforts to improve the performance of fences. The most aggressive designs allow post-fence accesses to retire and complete before the fence completes. Unfortunately, such designs present implementation difficulties due to their reliance on global state and structures. This paper's goal is to optimize both the performance and the implementability of fences. We start-off with a design like the most aggressive ones but without the global state. We call it Weak Fence or wF. Since the concurrent execution of multiple wFs can deadlock, we combine wFs with a conventional fence (i.e., Strong Fence or sF) for the less performance-critical thread(s). We call the result an Asymmetric fence group. We also propose a taxonomy of Asymmetric fence groups under TSO. Compared to past aggressive fences, Asymmetric fence groups both are substantially easier to implement and have higher average performance. The two main designs presented (WS+ and W+) speed-up workloads under TSO by an average of 13% and 21%, respectively, over conventional fences.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117160269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers rIOMMU:用于使用环缓冲区的I/O设备的高效IOMMU

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694355

M. Malka, Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir

{"title":"rIOMMU: Efficient IOMMU for I/O Devices that Employ Ring Buffers","authors":"M. Malka, Nadav Amit, Muli Ben-Yehuda, Dan Tsafrir","doi":"10.1145/2694344.2694355","DOIUrl":"https://doi.org/10.1145/2694344.2694355","url":null,"abstract":"The IOMMU allows the OS to encapsulate I/O devices in their own virtual memory spaces, thus restricting their DMAs to specific memory pages. The OS uses the IOMMU to protect itself against buggy drivers and malicious/errant devices. But the added protection comes at a cost, degrading the throughput of I/O-intensive workloads by up to an order of magnitude. This cost has motivated system designers to trade off some safety for performance, e.g., by leaving stale information in the IOTLB for a while so as to amortize costly invalidations. We observe that high-bandwidth devices---like network and PCIe SSD controllers---interact with the OS via circular ring buffers that induce a sequential, predictable workload. We design a ring IOMMU (rIOMMU) that leverages this characteristic by replacing the virtual memory page table hierarchy with a circular, flat table. A flat table is adequately supported by exactly one IOTLB entry, making every new translation an implicit invalidation of the former and thus requiring explicit invalidations only at the end of I/O bursts. Using standard networking benchmarks, we show that rIOMMU provides up to 7.56x higher throughput relative to the baseline IOMMU, and that it is within 0.77--1.00x the throughput of a system without IOMMU protection.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126976679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 44

Watson and the Era of Cognitive Computing 沃森与认知计算时代

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2786763.2694376

G. Banavar

{"title":"Watson and the Era of Cognitive Computing","authors":"G. Banavar","doi":"10.1145/2786763.2694376","DOIUrl":"https://doi.org/10.1145/2786763.2694376","url":null,"abstract":"In the last decade, the availability of massive amounts of new data, and the development of new machine learning technologies, have augmented reasoning systems to give rise to a new class of computing systems. These \"Cognitive Systems\" learn from data, reason from models, and interact naturally with us, to perform complex tasks better than either humans or machines can do by themselves. In essence, cognitive systems help us perform like the best by penetrating the complexity of big data and leverage the power of models. One of the first cognitive systems, called Watson, demonstrated through a Jeopardy! exhibition match, that it was capable of answering complex factoid questions as effectively as the world's champions. Follow-on cognitive systems perform other tasks, such as discovery, reasoning, and multi-modal understanding in a variety of domains, such as healthcare, insurance, and education. We believe such cognitive systems will transform every industry and our everyday life for the better. In this talk, I will give an overview of the applications, the underlying capabilities, and some of the key challenges, of cognitive systems.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121597689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

DIABLO: A Warehouse-Scale Computer Network Simulator using FPGAs 使用fpga的仓库级计算机网络模拟器

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694362

Zhangxi Tan, Zhenghao Qian, X. Chen, K. Asanović, D. Patterson

引用次数: 21

Beyond the PDP-11: Architectural Support for a Memory-Safe C Abstract Machine 超越PDP-11:对内存安全的C抽象机的体系结构支持

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694367

D. Chisnall, Colin Rothwell, R. Watson, Jonathan Woodruff, Munraj Vadera, S. Moore, M. Roe, Brooks Davis, P. Neumann

{"title":"Beyond the PDP-11: Architectural Support for a Memory-Safe C Abstract Machine","authors":"D. Chisnall, Colin Rothwell, R. Watson, Jonathan Woodruff, Munraj Vadera, S. Moore, M. Roe, Brooks Davis, P. Neumann","doi":"10.1145/2694344.2694367","DOIUrl":"https://doi.org/10.1145/2694344.2694367","url":null,"abstract":"We propose a new memory-safe interpretation of the C abstract machine that provides stronger protection to benefit security and debugging. Despite ambiguities in the specification intended to provide implementation flexibility, contemporary implementations of C have converged on a memory model similar to the PDP-11, the original target for C. This model lacks support for memory safety despite well-documented impacts on security and reliability. Attempts to change this model are often hampered by assumptions embedded in a large body of existing C code, dating back to the memory model exposed by the original C compiler for the PDP-11. Our experience with attempting to implement a memory-safe variant of C on the CHERI experimental microprocessor led us to identify a number of problematic idioms. We describe these as well as their interaction with existing memory safety schemes and the assumptions that they make beyond the requirements of the C specification. Finally, we refine the CHERI ISA and abstract model for C, by combining elements of the CHERI capability model and fat pointers, and present a softcore CPU that implements a C abstract machine that can run legacy C code with strong memory protection guarantees.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114846529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations 高效的支持任意同步，没有写者发起的无效

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694356

Hyojin Sung, S. Adve

{"title":"DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations","authors":"Hyojin Sung, S. Adve","doi":"10.1145/2694344.2694356","DOIUrl":"https://doi.org/10.1145/2694344.2694356","url":null,"abstract":"Current shared-memory hardware is complex and inefficient. Prior work on the DeNovo coherence protocol showed that disciplined shared-memory programming models can enable more complexity-, performance-, and energy-efficient hardware than the state-of-the-art MESI protocol. DeNovo, however, severely restricted the synchronization constructs an application can support. This paper proposes DeNovoSync, a technique to support arbitrary synchronization in DeNovo. The key challenge is that DeNovo exploits race-freedom to use reader-initiated local self-invalidations (instead of conventional writer-initiated remote cache invalidations) to ensure coherence. Synchronization accesses are inherently racy and not directly amenable to self-invalidations. DeNovoSync addresses this challenge using a novel combination of registration of all synchronization reads with a judicious hardware backoff to limit unnecessary registrations. For a wide variety of synchronization constructs and applications, compared to MESI, DeNovoSync shows comparable or up to 22% lower execution time and up to 58% lower network traffic, enabling DeNovo's advantages for a much broader class of software than previously possible.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121820889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

PuDianNao: A Polyvalent Machine Learning Accelerator

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems Pub Date : 2015-03-14 DOI: 10.1145/2694344.2694358

Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, O. Temam, Xiaobing Feng, Xuehai Zhou, Yunji Chen

{"title":"PuDianNao: A Polyvalent Machine Learning Accelerator","authors":"Dao-Fu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, O. Temam, Xiaobing Feng, Xuehai Zhou, Yunji Chen","doi":"10.1145/2694344.2694358","DOIUrl":"https://doi.org/10.1145/2694344.2694358","url":null,"abstract":"Machine Learning (ML) techniques are pervasive tools in various emerging commercial applications, but have to be accommodated by powerful computer systems to process very large data. Although general-purpose CPUs and GPUs have provided straightforward solutions, their energy-efficiencies are limited due to their excessive supports for flexibility. Hardware accelerators may achieve better energy-efficiencies, but each accelerator often accommodates only a single ML technique (family). According to the famous No-Free-Lunch theorem in the ML domain, however, an ML technique performs well on a dataset may perform poorly on another dataset, which implies that such accelerator may sometimes lead to poor learning accuracy. Even if regardless of the learning accuracy, such accelerator can still become inapplicable simply because the concrete ML task is altered, or the user chooses another ML technique. In this study, we present an ML accelerator called PuDianNao, which accommodates seven representative ML techniques, including k-means, k-nearest neighbors, naive bayes, support vector machine, linear regression, classification tree, and deep neural network. Benefited from our thorough analysis on computational primitives and locality properties of different ML techniques, PuDianNao can perform up to 1056 GOP/s (e.g., additions and multiplications) in an area of 3.51 mm^2, and consumes 596 mW only. Compared with the NVIDIA K20M GPU (28nm process), PuDianNao (65nm process) is 1.20x faster, and can reduce the energy by 128.41x.","PeriodicalId":403247,"journal":{"name":"Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123206708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 287