Proceedings of the 19th international conference on Architectural support for programming languages and operating systems最新文献

DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning DianNao:用于无处不在的机器学习的小足迹高吞吐量加速器

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541967

Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, O. Temam

{"title":"DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning","authors":"Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, O. Temam","doi":"10.1145/2541940.2541967","DOIUrl":"https://doi.org/10.1145/2541940.2541967","url":null,"abstract":"Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125292750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1435

Prototyping symbolic execution engines for interpreted languages 用于解释语言的符号执行引擎原型

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541977

Stefan Bucur, Johannes Kinder, George Candea

{"title":"Prototyping symbolic execution engines for interpreted languages","authors":"Stefan Bucur, Johannes Kinder, George Candea","doi":"10.1145/2541940.2541977","DOIUrl":"https://doi.org/10.1145/2541940.2541977","url":null,"abstract":"Symbolic execution is being successfully used to automatically test statically compiled code. However, increasingly more systems and applications are written in dynamic interpreted languages like Python. Building a new symbolic execution engine is a monumental effort, and so is keeping it up-to-date as the target language evolves. Furthermore, ambiguous language specifications lead to their implementation in a symbolic execution engine potentially differing from the production interpreter in subtle ways. We address these challenges by flipping the problem and using the interpreter itself as a specification of the language semantics. We present a recipe and tool (called Chef) for turning a vanilla interpreter into a sound and complete symbolic execution engine. Chef symbolically executes the target program by symbolically executing the interpreter's binary while exploiting inferred knowledge about the program's high-level structure. Using Chef, we developed a symbolic execution engine for Python in 5 person-days and one for Lua in 3 person-days. They offer complete and faithful coverage of language features in a way that keeps up with future language versions at near-zero cost. Chef-produced engines are up to 1000 times more performant than if directly executing the interpreter symbolically without Chef.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Energy-efficient work-stealing language runtimes 高效的窃取工作的语言运行时

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541971

Haris Ribic, Yu David Liu

引用次数: 55

Q100: the architecture and design of a database processing unit Q100:数据库处理单元的体系结构和设计

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541961

Lisa Wu, A. Lottarini, Timothy K. Paine, Martha A. Kim, K. A. Ross

{"title":"Q100: the architecture and design of a database processing unit","authors":"Lisa Wu, A. Lottarini, Timothy K. Paine, Martha A. Kim, K. A. Ross","doi":"10.1145/2541940.2541961","DOIUrl":"https://doi.org/10.1145/2541940.2541961","url":null,"abstract":"In this paper, we propose Database Processing Units, or DPUs, a class of domain-specific database processors that can efficiently handle database applications. As a proof of concept, we present the instruction set architecture, microarchitecture, and hardware implementation of one DPU, called Q100. The Q100 has a collection of heterogeneous ASIC tiles that process relational tables and columns quickly and energy-efficiently. The architecture uses coarse grained in- structions that manipulate streams of data, thereby maximizing pipeline and data parallelism, and minimizing the need to time multiplex the accelerator tiles and spill inter- mediate results to memory. This work explores a Q100 de- sign space of 150 configurations, selecting three for further analysis: a small, power-conscious implementation, a high- performance implementation, and a balanced design that maximizes performance per Watt. We then demonstrate that the power-conscious Q100 handles the TPC-H queries with three orders of magnitude less energy than a state of the art software DBMS, while the performance-oriented design out- performs the same DBMS by 70X.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"479 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123036056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 171

REF: resource elasticity fairness with sharing incentives for multiprocessors 参考:多处理器共享激励下的资源弹性公平

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541962

S. Zahedi, Benjamin C. Lee

引用次数: 96

Session details: Caches and TLBs 会话细节:缓存和tlb

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/3260935

K. McKinley

引用次数: 0

Neuromorphic processing: a new frontier in scaling computer architecture 神经形态处理:缩放计算机体系结构的新前沿

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2654822.2564710

Jeff Gehlhaar

{"title":"Neuromorphic processing: a new frontier in scaling computer architecture","authors":"Jeff Gehlhaar","doi":"10.1145/2654822.2564710","DOIUrl":"https://doi.org/10.1145/2654822.2564710","url":null,"abstract":"The desire to build a computer that operates in the same manner as our brains is as old as the computer itself. Although computer engineering has made great strides in hardware performance as a result of Dennard scaling, and even great advances in 'brain like' computation, the field still struggles to move beyond sequential, analytical computing architectures. Neuromorphic systems are being developed to transcend the barriers imposed by silicon power consumption, develop new algorithms that help machines achieve cognitive behaviors, and both exploit and enable further research in neuroscience. In this talk I will discuss a system im-plementing spiking neural networks. These systems hold the promise of an architecture that is event based, broad and shallow, and thus more power efficient than conventional computing solu-tions. This new approach to computation based on modeling the brain and its simple but highly connected units presents a host of new challenges. Hardware faces tradeoffs such as density or lower power at the cost of high interconnection overhead. Consequently, software systems must face choices about new language design. Highly distributed hardware systems require complex place and route algorithms to distribute the execution of the neural network across a large number of highly interconnected processing units. Finally, the overall design, simulation and testing process has to be entirely reimagined. We discuss these issues in the context of the Zeroth processor and how this approach compares to other neuromorphic systems that are becoming available.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Heterogeneous-race-free memory models 异构无竞争内存模型

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541981

Derek Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, M. Hill, S. Reinhardt, D. Wood

{"title":"Heterogeneous-race-free memory models","authors":"Derek Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, M. Hill, S. Reinhardt, D. Wood","doi":"10.1145/2541940.2541981","DOIUrl":"https://doi.org/10.1145/2541940.2541981","url":null,"abstract":"Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues. In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with \"sufficient\" synchronization (no data races) of \"sufficient\" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130179785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96

NVM duet: unified working memory and persistent store architecture NVM二重奏:统一的工作内存和持久存储架构

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541957

Ren-Shuo Liu, De-Yu Shen, Chia-Lin Yang, Shun-Chih Yu, Cheng-Yuan Michael Wang

{"title":"NVM duet: unified working memory and persistent store architecture","authors":"Ren-Shuo Liu, De-Yu Shen, Chia-Lin Yang, Shun-Chih Yu, Cheng-Yuan Michael Wang","doi":"10.1145/2541940.2541957","DOIUrl":"https://doi.org/10.1145/2541940.2541957","url":null,"abstract":"Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising alternative to conventional persistent store. With NVM, programmers can persistently retain in-memory data structures without writing them to disk. Therefore, one can envision that in the future, NVM will play the role of both working memory and persistent store at the same time. Persistent store demands consistency and durability guarantees, thereby imposing new design constraints on the memory system. Consistency is achieved at the expense of serializing multiple write operations. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. Therefore, a unified architecture oblivious to these two use cases would lead to suboptimal design. In this paper, we propose a novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory. A cross-layer design approach is adopted to achieve the design goal. Overall, simulation results demonstrate that NVM Duet achieves up to 1.68x (1.32x on average) speedup compared with the baseline design.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131004082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 111

Challenging the "embarrassingly sequential": parallelizing finite state machine-based computations through principled speculation 挑战“令人尴尬的顺序”:通过有原则的推测并行化基于有限状态机的计算

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems Pub Date : 2014-02-24 DOI: 10.1145/2541940.2541989

Zhijia Zhao, Bo Wu, Xipeng Shen

引用次数: 39