Proceedings of the 19th international conference on Architectural support for programming languages and operating systems最新文献

筛选
英文 中文
DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning DianNao:用于无处不在的机器学习的小足迹高吞吐量加速器
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, O. Temam
{"title":"DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning","authors":"Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, O. Temam","doi":"10.1145/2541940.2541967","DOIUrl":"https://doi.org/10.1145/2541940.2541967","url":null,"abstract":"Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convolutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm2 and 485 mW; compared to a 128-bit 2GHz SIMD processor, the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65 nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125292750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1435
Prototyping symbolic execution engines for interpreted languages 用于解释语言的符号执行引擎原型
Stefan Bucur, Johannes Kinder, George Candea
{"title":"Prototyping symbolic execution engines for interpreted languages","authors":"Stefan Bucur, Johannes Kinder, George Candea","doi":"10.1145/2541940.2541977","DOIUrl":"https://doi.org/10.1145/2541940.2541977","url":null,"abstract":"Symbolic execution is being successfully used to automatically test statically compiled code. However, increasingly more systems and applications are written in dynamic interpreted languages like Python. Building a new symbolic execution engine is a monumental effort, and so is keeping it up-to-date as the target language evolves. Furthermore, ambiguous language specifications lead to their implementation in a symbolic execution engine potentially differing from the production interpreter in subtle ways. We address these challenges by flipping the problem and using the interpreter itself as a specification of the language semantics. We present a recipe and tool (called Chef) for turning a vanilla interpreter into a sound and complete symbolic execution engine. Chef symbolically executes the target program by symbolically executing the interpreter's binary while exploiting inferred knowledge about the program's high-level structure. Using Chef, we developed a symbolic execution engine for Python in 5 person-days and one for Lua in 3 person-days. They offer complete and faithful coverage of language features in a way that keeps up with future language versions at near-zero cost. Chef-produced engines are up to 1000 times more performant than if directly executing the interpreter symbolically without Chef.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115332926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Energy-efficient work-stealing language runtimes 高效的窃取工作的语言运行时
Haris Ribic, Yu David Liu
{"title":"Energy-efficient work-stealing language runtimes","authors":"Haris Ribic, Yu David Liu","doi":"10.1145/2541940.2541971","DOIUrl":"https://doi.org/10.1145/2541940.2541971","url":null,"abstract":"Work stealing is a promising approach to constructing multithreaded program runtimes of parallel programming languages. This paper presents HERMES, an energy-efficient work-stealing language runtime. The key insight is that threads in a work-stealing environment -- thieves and victims - have varying impacts on the overall program running time, and a coordination of their execution \"tempo\" can lead to energy efficiency with minimal performance loss. The centerpiece of HERMES is two complementary algorithms to coordinate thread tempo: the workpath-sensitive algorithm determines tempo for each thread based on thief-victim relationships on the execution path, whereas the workload-sensitive algorithm selects appropriate tempo based on the size of work-stealing deques. We construct HERMES on top of Intel Cilk Plus's runtime, and implement tempo adjustment through standard Dynamic Voltage and Frequency Scaling (DVFS). Benchmarks running on HERMES demonstrate an average of 11-12% energy savings with an average of 3-4% performance loss through meter-based measurements over commercial CPUs.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115459494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Q100: the architecture and design of a database processing unit Q100:数据库处理单元的体系结构和设计
Lisa Wu, A. Lottarini, Timothy K. Paine, Martha A. Kim, K. A. Ross
{"title":"Q100: the architecture and design of a database processing unit","authors":"Lisa Wu, A. Lottarini, Timothy K. Paine, Martha A. Kim, K. A. Ross","doi":"10.1145/2541940.2541961","DOIUrl":"https://doi.org/10.1145/2541940.2541961","url":null,"abstract":"In this paper, we propose Database Processing Units, or DPUs, a class of domain-specific database processors that can efficiently handle database applications. As a proof of concept, we present the instruction set architecture, microarchitecture, and hardware implementation of one DPU, called Q100. The Q100 has a collection of heterogeneous ASIC tiles that process relational tables and columns quickly and energy-efficiently. The architecture uses coarse grained in- structions that manipulate streams of data, thereby maximizing pipeline and data parallelism, and minimizing the need to time multiplex the accelerator tiles and spill inter- mediate results to memory. This work explores a Q100 de- sign space of 150 configurations, selecting three for further analysis: a small, power-conscious implementation, a high- performance implementation, and a balanced design that maximizes performance per Watt. We then demonstrate that the power-conscious Q100 handles the TPC-H queries with three orders of magnitude less energy than a state of the art software DBMS, while the performance-oriented design out- performs the same DBMS by 70X.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"479 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123036056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 171
REF: resource elasticity fairness with sharing incentives for multiprocessors 参考:多处理器共享激励下的资源弹性公平
S. Zahedi, Benjamin C. Lee
{"title":"REF: resource elasticity fairness with sharing incentives for multiprocessors","authors":"S. Zahedi, Benjamin C. Lee","doi":"10.1145/2541940.2541962","DOIUrl":"https://doi.org/10.1145/2541940.2541962","url":null,"abstract":"With the democratization of cloud and datacenter computing, users increasingly share large hardware platforms. In this setting, architects encounter two challenges: sharing fairly and sharing multiple resources. Drawing on economic game-theory, we rethink fairness in computer architecture. A fair allocation must provide sharing incentives (SI), envy-freeness (EF), and Pareto efficiency (PE). We show that Cobb-Douglas utility functions are well suited to modeling user preferences for cache capacity and memory bandwidth. And we present an allocation mechanism that uses Cobb-Douglas preferences to determine each user's fair share of the hardware. This mechanism provably guarantees SI, EF, and PE, as well as strategy-proofness in the large (SPL). And it does so with modest performance penalties, less than 10% throughput loss, relative to an unfair mechanism.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122180137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
Session details: Caches and TLBs 会话细节:缓存和tlb
K. McKinley
{"title":"Session details: Caches and TLBs","authors":"K. McKinley","doi":"10.1145/3260935","DOIUrl":"https://doi.org/10.1145/3260935","url":null,"abstract":"","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114869511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Neuromorphic processing: a new frontier in scaling computer architecture 神经形态处理:缩放计算机体系结构的新前沿
Jeff Gehlhaar
{"title":"Neuromorphic processing: a new frontier in scaling computer architecture","authors":"Jeff Gehlhaar","doi":"10.1145/2654822.2564710","DOIUrl":"https://doi.org/10.1145/2654822.2564710","url":null,"abstract":"The desire to build a computer that operates in the same manner as our brains is as old as the computer itself. Although computer engineering has made great strides in hardware performance as a result of Dennard scaling, and even great advances in 'brain like' computation, the field still struggles to move beyond sequential, analytical computing architectures. Neuromorphic systems are being developed to transcend the barriers imposed by silicon power consumption, develop new algorithms that help machines achieve cognitive behaviors, and both exploit and enable further research in neuroscience. In this talk I will discuss a system im-plementing spiking neural networks. These systems hold the promise of an architecture that is event based, broad and shallow, and thus more power efficient than conventional computing solu-tions. This new approach to computation based on modeling the brain and its simple but highly connected units presents a host of new challenges. Hardware faces tradeoffs such as density or lower power at the cost of high interconnection overhead. Consequently, software systems must face choices about new language design. Highly distributed hardware systems require complex place and route algorithms to distribute the execution of the neural network across a large number of highly interconnected processing units. Finally, the overall design, simulation and testing process has to be entirely reimagined. We discuss these issues in the context of the Zeroth processor and how this approach compares to other neuromorphic systems that are becoming available.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Heterogeneous-race-free memory models 异构无竞争内存模型
Derek Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, M. Hill, S. Reinhardt, D. Wood
{"title":"Heterogeneous-race-free memory models","authors":"Derek Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, M. Hill, S. Reinhardt, D. Wood","doi":"10.1145/2541940.2541981","DOIUrl":"https://doi.org/10.1145/2541940.2541981","url":null,"abstract":"Commodity heterogeneous systems (e.g., integrated CPUs and GPUs), now support a unified, shared memory address space for all components. Because the latency of global communication in a heterogeneous system can be prohibi-tively high, heterogeneous systems (unlike homogeneous CPU systems) provide synchronization mechanisms that only guarantee ordering among a subset of threads, which we call a scope. Unfortunately, the consequences and se-mantics of these scoped operations are not yet well under-stood. Without a formal and approachable model to reason about the behavior of these operations, we risk an array of portability and performance issues. In this paper, we embrace scoped synchronization with a new class of memory consistency models that add scoped synchronization to data-race-free models like those of C++ and Java. Called sequential consistency for heterogeneous-race-free (SC for HRF), the new models guarantee SC for programs with \"sufficient\" synchronization (no data races) of \"sufficient\" scope. We discuss two such models. The first, HRF-direct, works well for programs with highly regular parallelism. The second, HRF-indirect, builds on HRF-direct by allowing synchronization using different scopes in some cases involving transitive communication. We quanti-tatively show that HRF-indirect encourages forward-looking programs with irregular parallelism by showing up to a 10% performance increase in a task runtime for GPUs.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130179785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
NVM duet: unified working memory and persistent store architecture NVM二重奏:统一的工作内存和持久存储架构
Ren-Shuo Liu, De-Yu Shen, Chia-Lin Yang, Shun-Chih Yu, Cheng-Yuan Michael Wang
{"title":"NVM duet: unified working memory and persistent store architecture","authors":"Ren-Shuo Liu, De-Yu Shen, Chia-Lin Yang, Shun-Chih Yu, Cheng-Yuan Michael Wang","doi":"10.1145/2541940.2541957","DOIUrl":"https://doi.org/10.1145/2541940.2541957","url":null,"abstract":"Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising alternative to conventional persistent store. With NVM, programmers can persistently retain in-memory data structures without writing them to disk. Therefore, one can envision that in the future, NVM will play the role of both working memory and persistent store at the same time. Persistent store demands consistency and durability guarantees, thereby imposing new design constraints on the memory system. Consistency is achieved at the expense of serializing multiple write operations. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. Therefore, a unified architecture oblivious to these two use cases would lead to suboptimal design. In this paper, we propose a novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory. A cross-layer design approach is adopted to achieve the design goal. Overall, simulation results demonstrate that NVM Duet achieves up to 1.68x (1.32x on average) speedup compared with the baseline design.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131004082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 111
Challenging the "embarrassingly sequential": parallelizing finite state machine-based computations through principled speculation 挑战“令人尴尬的顺序”:通过有原则的推测并行化基于有限状态机的计算
Zhijia Zhao, Bo Wu, Xipeng Shen
{"title":"Challenging the \"embarrassingly sequential\": parallelizing finite state machine-based computations through principled speculation","authors":"Zhijia Zhao, Bo Wu, Xipeng Shen","doi":"10.1145/2541940.2541989","DOIUrl":"https://doi.org/10.1145/2541940.2541989","url":null,"abstract":"Finite-State Machine (FSM) applications are important for many domains. But FSM computation is inherently sequential, making such applications notoriously difficult to parallelize. Most prior methods address the problem through speculations on simple heuristics, offering limited applicability and inconsistent speedups. This paper provides some principled understanding of FSM parallelization, and offers the first disciplined way to exploit application-specific information to inform speculations for parallelization. Through a series of rigorous analysis, it presents a probabilistic model that captures the relations between speculative executions and the properties of the target FSM and its inputs. With the formulation, it proposes two model-based speculation schemes that automatically customize themselves with the suitable configurations to maximize the parallelization benefits. This rigorous treatment yields near-linear speedup on applications that state-of-the-art techniques can barely accelerate.","PeriodicalId":128805,"journal":{"name":"Proceedings of the 19th international conference on Architectural support for programming languages and operating systems","volume":"254 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116880367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信