{"title":"Student research poster: Network controller emulation on a sidecore for unmodified virtual machines","authors":"Arthur Kiyanovski","doi":"10.1145/2967938.2971469","DOIUrl":"https://doi.org/10.1145/2967938.2971469","url":null,"abstract":"Paravirtual I/O devices are known to outperform emulated I/O devices but this performance improvement comes with two major drawbacks: Guest machine owners must install hypervisor-specific device drivers every time they switch hypervisors, and these device drivers must be implemented by the hypervisor providers for all major operating systems. Emulated devices do not suffer from these drawbacks because their drivers are implemented by the manufacturers of the bare-metal devices, and come preinstalled. We used optimizations from the virtio-net paravirtual network device combined with a sidecore to improve emulation of the E1000 network device in the QEMU hypervisor. Initial results show that the performance gap between emulated and paravirtual I/O devices is smaller than was previously thought. The small performance difference between paravirtual and emulated devices, along with the aforementioned advantages of the latter, makes emulation a natural choice when flexibility takes precedence over performance.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132585516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaowei Shen, Xiaochun Ye, Xu Tan, Da Wang, Zhimin Zhang, Dongrui Fan, Zhimin Tang
{"title":"POSTER: An optimization of dataflow architectures for scientific applications","authors":"Xiaowei Shen, Xiaochun Ye, Xu Tan, Da Wang, Zhimin Zhang, Dongrui Fan, Zhimin Tang","doi":"10.1145/2967938.2974054","DOIUrl":"https://doi.org/10.1145/2967938.2974054","url":null,"abstract":"Dataflow computing is proved to be promising in high-performance computing. However, traditional dataflow architectures are general-purpose and not efficient enough when dealing with typical scientific applications due to low utilization of function units. In this paper, we propose an optimization of dataflow architectures for scientific applications. The optimization introduces a request for operands mechanism and a topology-based instruction mapping algorithm to improve the efficiency of dataflow architectures. Experimental results show that the request for operands optimization achieves a 4.6% average performance improvement over the traditional dataflow architectures and the TBIM algorithm achieves a 2.28× and a 1.98× average performance improvement over SPDI and SPS algorithm respectively.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132169851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim
{"title":"Accelerating linked-list traversal through near-data processing","authors":"B. Hong, Gwangsun Kim, Jung Ho Ahn, Yongkee Kwon, Hongsik Kim, John Kim","doi":"10.1145/2967938.2967958","DOIUrl":"https://doi.org/10.1145/2967938.2967958","url":null,"abstract":"Recent technology advances in memory system design, along with 3D stacking, have made near-data processing (NDP) more feasible to accelerate different workloads. In this work, we explore near-data processing for a fundamental operation - linked-list traversal (LLT). We propose a new NDP architecture that does not change the existing sequential programming model and does not require any modification to the processor microarchitecture. Instead, we exploit the packetized interface between the core and the memory modules to off-load LLT for NDP. We leverage a system with multiple memory modules (e.g., hybrid memory cube (HMC) modules) interconnected with a memory network and our initial evaluation shows that simply off-loading LLT computation to near-memory can actually reduce performance because of the additional off-chip memory network channel traversals. Thus, we first propose NDP-aware data localization to exploit locality - including locality within a single memory module and memory vault - to minimize latency and improve energy efficiency. In order to improve overall throughput and maximize parallelism, we propose batching multiple LLT operations together to amortize the cost of NDP by utilizing the highly parallel execution of NDP processing units and the high bandwidth of 3D stacked DRAM. The combination of NDP-aware data localization and batching can provide significant improvement in performance and energy efficiency compared to host-processing.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"11 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130280756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A DSL compiler for accelerating image processing pipelines on FPGAs","authors":"Nitin Chugh, Vinay Vasista, Suresh Purini, Uday Bondhugula","doi":"10.1145/2967938.2967969","DOIUrl":"https://doi.org/10.1145/2967938.2967969","url":null,"abstract":"This paper describes an automatic approach to accelerate image processing pipelines using FPGAs. An image processing pipeline can be viewed as a graph of interconnected stages that processes images successively. Each stage typically performs a point-wise, stencil, or other more complex operations on image pixels. Recent efforts have led to the development of domain-specific languages (DSL) and optimization frameworks for image processing pipelines. In this paper, we develop an approach to map image processing pipelines expressed in the PolyMage DSL to efficient parallel FPGA designs. Our approach exploits reuse and available memory bandwidth (or chip resources) maximally. When compared to Darkroom, a state-of-the-art approach to compile high-level DSL to FPGAs, our approach (a) leads to designs that deliver significantly higher throughput, and (b) supports a greater variety of filters. Furthermore, the designs we generate obtain an improvement even over pre-optimized FPGA implementations provided by vendor libraries for some of the benchmarks.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122470621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reduction drawing: Language constructs and polyhedral compilation for reductions on GPUs","authors":"Chandan Reddy, Michael Kruse, Albert Cohen","doi":"10.1145/2967938.2967950","DOIUrl":"https://doi.org/10.1145/2967938.2967950","url":null,"abstract":"Reductions are common in scientific and data-crunching codes, and a typical source of bottlenecks on massively parallel architectures such as GPUs. Reductions are memory-bound, and achieving peak performance involves sophisticated optimizations. There exist libraries such as CUB and Thrust providing highly tuned implementations of reductions on GPUs. However, library APIs are not flexible enough to express user-defined reductions on arbitrary data types and array indexing schemes. Languages such as OpenACC provide declarative syntax to express reductions. Such approaches support a limited range of reduction operators and do not facilitate the application of complex program transformations in presence of reductions. We present language constructs that let a programmer express arbitrary reductions on user-defined data types matching the performance of tuned library implementations. We also extend a polyhedral compilation flow to process these user-defined reductions, enabling optimizations such as the fusion of multiple reductions, combining reductions with other loop transformations, and optimizing data transfers and storage in the presence of reductions. We implemented these language constructs and compilation methods in the PPCG framework and conducted experiments on multiple GPU targets. For single reductions the generated code performs on par with highly tuned libraries, and for multiple reductions it significantly outperforms both libraries and OpenACC on all platforms.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129387739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling data analytics with moore's law","authors":"K. Olukotun","doi":"10.1145/2967938.2970375","DOIUrl":"https://doi.org/10.1145/2967938.2970375","url":null,"abstract":"Analyzing the volume, variety and velocity of big data requires the use of modern heterogeneous computing platforms composed of multicores with SIMD execution units, GPUs, clusters, FPGAs and in the future new reconfigurable architectures. However, programming in this environment is extremely challenging due to the need to use multiple low-level programming models and then combine them together in ad-hoc ways. Furthermore, many data analytics algorithms do not take full advantage of modern hardware capabilities. To optimize big data applications both for modern hardware and for modern programmers needs algorithms specialized for modern hardware and a high-level programming model that executes efficiently on heterogeneous parallel hardware. In this talk, I will describe the Delite DSL framework, which uses nested parallel patterns encapsulated in domain specific languages (DSLs). I will describe how a nested parallel pattern based programming model can be used to develop new data analytics algorithms that are optimized for architectures as diverse as multicore/NUMA, clusters, GPUs, FPGAs and a new reconfigurable architecture called Plasticine.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124973391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"WearCore: A core for wearable workloads?","authors":"Sanyam Mehta, J. Torrellas","doi":"10.1145/2967938.2967956","DOIUrl":"https://doi.org/10.1145/2967938.2967956","url":null,"abstract":"Lately, the industry has recognized immense potential in wearables (particularly, smartwatches) being an attractive alternative/supplement to the smartphone. To this end, there has been recent activity in making the smartwatch `self-sufficient' i.e. using it to make/receive calls, etc. independently of the phone. This marked shift in the way wearables will be used in future calls for changes in the core micro-architecture of smartwatch processors. In this work, we first identify ten key target applications for the smartwatch users that the processor must be able to quickly and efficiently execute. We show that seven of these workloads are inherently parallel, and are compute- and data-intensive. We therefore propose to use a multi-core processor with simple out-of-order cores (for compute performance) and augment them with a light-weight software-assisted hardware prefetcher (for memory performance). This simple core with the light-weight prefetcher, called WearCore, is 2.9× more energy-efficient and 2.8× more area-efficient over an in-order core. The improvements are similar with respect to an out-of-order core.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115172415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Diogo Sampaio, A. Ketterlin, L. Pouchet, F. Rastello
{"title":"Hybrid data dependence analysis for loop transformations","authors":"Diogo Sampaio, A. Ketterlin, L. Pouchet, F. Rastello","doi":"10.1145/2967938.2974059","DOIUrl":"https://doi.org/10.1145/2967938.2974059","url":null,"abstract":"Loop optimizations span from vectorization, scalar promotion, loop invariant code motion, software pipelining to loop fusion, skewing, tiling [6, Ch.9] and loop parallelization. These transformations are essential in the quest for automated high-performance code generation.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121443458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big data analytics on flash storage with accelerators","authors":"Arvind","doi":"10.1145/2967938.2970374","DOIUrl":"https://doi.org/10.1145/2967938.2970374","url":null,"abstract":"Complex analytics of the vast amount of data collected via social media, cell phones, ubiquitous smart sensors, and satellites is likely to be the biggest economic driver for the IT industry over the next decade. For many “Big Data” applications, the limiting factor in performance is often the transportation of large amount of data from hard disks to where it can be processed, i.e. DRAM. We will present BlueDBM, an architecture for a scalable distributed flash store which overcomes this limitation by providing a high-performance, high-capacity, scalable random-access flash storage, and by allowing computation near the data via a FPGA-based programmable flash controller. We will present the preliminary results for two applications, (1) key-value store (KVS) and (2) sparse-matrix accelerator for graph processing, on BlueDBM consisting of 20 nodes and 20TB of flash.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121544577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Vectorization of multibyte floating point data formats","authors":"Andrew Anderson, David Gregg","doi":"10.1145/2967938.2967966","DOIUrl":"https://doi.org/10.1145/2967938.2967966","url":null,"abstract":"We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native vector hardware allows us to support reduced precision floating point with low overhead. We demonstrate that supporting reduced precision in the compiler as opposed to using a library approach can yield a low overhead solution for GPPs.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129954112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}