2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献

筛选
英文 中文
Student research poster: Network controller emulation on a sidecore for unmodified virtual machines 学生研究海报:未修改虚拟机的侧核网络控制器仿真
Arthur Kiyanovski
{"title":"Student research poster: Network controller emulation on a sidecore for unmodified virtual machines","authors":"Arthur Kiyanovski","doi":"10.1145/2967938.2971469","DOIUrl":"https://doi.org/10.1145/2967938.2971469","url":null,"abstract":"Paravirtual I/O devices are known to outperform emulated I/O devices but this performance improvement comes with two major drawbacks: Guest machine owners must install hypervisor-specific device drivers every time they switch hypervisors, and these device drivers must be implemented by the hypervisor providers for all major operating systems. Emulated devices do not suffer from these drawbacks because their drivers are implemented by the manufacturers of the bare-metal devices, and come preinstalled. We used optimizations from the virtio-net paravirtual network device combined with a sidecore to improve emulation of the E1000 network device in the QEMU hypervisor. Initial results show that the performance gap between emulated and paravirtual I/O devices is smaller than was previously thought. The small performance difference between paravirtual and emulated devices, along with the aforementioned advantages of the latter, makes emulation a natural choice when flexibility takes precedence over performance.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132585516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POSTER: An optimization of dataflow architectures for scientific applications POSTER:科学应用的数据流架构优化
Xiaowei Shen, Xiaochun Ye, Xu Tan, Da Wang, Zhimin Zhang, Dongrui Fan, Zhimin Tang
{"title":"POSTER: An optimization of dataflow architectures for scientific applications","authors":"Xiaowei Shen, Xiaochun Ye, Xu Tan, Da Wang, Zhimin Zhang, Dongrui Fan, Zhimin Tang","doi":"10.1145/2967938.2974054","DOIUrl":"https://doi.org/10.1145/2967938.2974054","url":null,"abstract":"Dataflow computing is proved to be promising in high-performance computing. However, traditional dataflow architectures are general-purpose and not efficient enough when dealing with typical scientific applications due to low utilization of function units. In this paper, we propose an optimization of dataflow architectures for scientific applications. The optimization introduces a request for operands mechanism and a topology-based instruction mapping algorithm to improve the efficiency of dataflow architectures. Experimental results show that the request for operands optimization achieves a 4.6% average performance improvement over the traditional dataflow architectures and the TBIM algorithm achieves a 2.28× and a 1.98× average performance improvement over SPDI and SPS algorithm respectively.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132169851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Accelerating linked-list traversal through near-data processing 通过近数据处理加速链表遍历
B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim
{"title":"Accelerating linked-list traversal through near-data processing","authors":"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim","doi":"10.1145/2967938.2967958","DOIUrl":"https://doi.org/10.1145/2967938.2967958","url":null,"abstract":"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130280756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
A DSL compiler for accelerating image processing pipelines on FPGAs 用于加速fpga上图像处理管道的DSL编译器
Nitin Chugh, Vinay Vasista, Suresh Purini, Uday Bondhugula
{"title":"A DSL compiler for accelerating image processing pipelines on FPGAs","authors":"Nitin Chugh, Vinay Vasista, Suresh Purini, Uday Bondhugula","doi":"10.1145/2967938.2967969","DOIUrl":"https://doi.org/10.1145/2967938.2967969","url":null,"abstract":"This paper describes an automatic approach to accelerate image processing pipelines using FPGAs. An image processing pipeline can be viewed as a graph of interconnected stages that processes images successively. Each stage typically performs a point-wise, stencil, or other more complex operations on image pixels. Recent efforts have led to the development of domain-specific languages (DSL) and optimization frameworks for image processing pipelines. In this paper, we develop an approach to map image processing pipelines expressed in the PolyMage DSL to efficient parallel FPGA designs. Our approach exploits reuse and available memory bandwidth (or chip resources) maximally. When compared to Darkroom, a state-of-the-art approach to compile high-level DSL to FPGAs, our approach (a) leads to designs that deliver significantly higher throughput, and (b) supports a greater variety of filters. Furthermore, the designs we generate obtain an improvement even over pre-optimized FPGA implementations provided by vendor libraries for some of the benchmarks.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122470621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
Reduction drawing: Language constructs and polyhedral compilation for reductions on GPUs 约简图:gpu上约简的语言构造和多面体编译
Chandan Reddy, Michael Kruse, Albert Cohen
{"title":"Reduction drawing: Language constructs and polyhedral compilation for reductions on GPUs","authors":"Chandan Reddy, Michael Kruse, Albert Cohen","doi":"10.1145/2967938.2967950","DOIUrl":"https://doi.org/10.1145/2967938.2967950","url":null,"abstract":"Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129387739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Scaling data analytics with moore's law 用摩尔定律扩展数据分析
K. Olukotun
{"title":"Scaling data analytics with moore's law","authors":"K. Olukotun","doi":"10.1145/2967938.2970375","DOIUrl":"https://doi.org/10.1145/2967938.2970375","url":null,"abstract":"Analyzing the volume, variety and velocity of big data requires the use of modern heterogeneous computing platforms composed of multicores with SIMD execution units, GPUs, clusters, FPGAs and in the future new reconfigurable architectures. However, programming in this environment is extremely challenging due to the need to use multiple low-level programming models and then combine them together in ad-hoc ways. Furthermore, many data analytics algorithms do not take full advantage of modern hardware capabilities. To optimize big data applications both for modern hardware and for modern programmers needs algorithms specialized for modern hardware and a high-level programming model that executes efficiently on heterogeneous parallel hardware. In this talk, I will describe the Delite DSL framework, which uses nested parallel patterns encapsulated in domain specific languages (DSLs). I will describe how a nested parallel pattern based programming model can be used to develop new data analytics algorithms that are optimized for architectures as diverse as multicore/NUMA, clusters, GPUs, FPGAs and a new reconfigurable architecture called Plasticine.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124973391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
WearCore: A core for wearable workloads? WearCore:可穿戴工作负载的核心?
Sanyam Mehta, J. Torrellas
{"title":"WearCore: A core for wearable workloads?","authors":"Sanyam Mehta, J. Torrellas","doi":"10.1145/2967938.2967956","DOIUrl":"https://doi.org/10.1145/2967938.2967956","url":null,"abstract":"Lately, the industry has recognized immense potential in wearables (particularly, smartwatches) being an attractive alternative/supplement to the smartphone. To this end, there has been recent activity in making the smartwatch `self-sufficient' i.e. using it to make/receive calls, etc. independently of the phone. This marked shift in the way wearables will be used in future calls for changes in the core micro-architecture of smartwatch processors. In this work, we first identify ten key target applications for the smartwatch users that the processor must be able to quickly and efficiently execute. We show that seven of these workloads are inherently parallel, and are compute- and data-intensive. We therefore propose to use a multi-core processor with simple out-of-order cores (for compute performance) and augment them with a light-weight software-assisted hardware prefetcher (for memory performance). This simple core with the light-weight prefetcher, called WearCore, is 2.9× more energy-efficient and 2.8× more area-efficient over an in-order core. The improvements are similar with respect to an out-of-order core.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115172415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Hybrid data dependence analysis for loop transformations 循环转换的混合数据依赖分析
Diogo Sampaio, A. Ketterlin, L. Pouchet, F. Rastello
{"title":"Hybrid data dependence analysis for loop transformations","authors":"Diogo Sampaio, A. Ketterlin, L. Pouchet, F. Rastello","doi":"10.1145/2967938.2974059","DOIUrl":"https://doi.org/10.1145/2967938.2974059","url":null,"abstract":"Loop optimizations span from vectorization, scalar promotion, loop invariant code motion, software pipelining to loop fusion, skewing, tiling [6, Ch.9] and loop parallelization. These transformations are essential in the quest for automated high-performance code generation.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121443458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Big data analytics on flash storage with accelerators 带加速器的闪存大数据分析
Arvind
{"title":"Big data analytics on flash storage with accelerators","authors":"Arvind","doi":"10.1145/2967938.2970374","DOIUrl":"https://doi.org/10.1145/2967938.2970374","url":null,"abstract":"Complex analytics of the vast amount of data collected via social media, cell phones, ubiquitous smart sensors, and satellites is likely to be the biggest economic driver for the IT industry over the next decade. For many “Big Data” applications, the limiting factor in performance is often the transportation of large amount of data from hard disks to where it can be processed, i.e. DRAM. We will present BlueDBM, an architecture for a scalable distributed flash store which overcomes this limitation by providing a high-performance, high-capacity, scalable random-access flash storage, and by allowing computation near the data via a FPGA-based programmable flash controller. We will present the preliminary results for two applications, (1) key-value store (KVS) and (2) sparse-matrix accelerator for graph processing, on BlueDBM consisting of 20 nodes and 20TB of flash.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121544577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Vectorization of multibyte floating point data formats 多字节浮点数据格式的矢量化
Andrew Anderson, David Gregg
{"title":"Vectorization of multibyte floating point data formats","authors":"Andrew Anderson, David Gregg","doi":"10.1145/2967938.2967966","DOIUrl":"https://doi.org/10.1145/2967938.2967966","url":null,"abstract":"We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native vector hardware allows us to support reduced precision floating point with low overhead. We demonstrate that supporting reduced precision in the compiler as opposed to using a library approach can yield a low overhead solution for GPPs.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129954112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信