{"title":"A Re-Configurable Ray-Triangle Vector Accelerator for Emerging Fog Architectures","authors":"Adrianno Sampaio, A. Sena, A. S. Nery","doi":"10.1109/IPDPSW.2019.00136","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00136","url":null,"abstract":"One of the biggest challenges in computer graphics is to produce photo-realistic images from a three-dimensional scene. On one hand, there are fast ways of rendering an image that often cannot portray the light behavior accurately. On the other hand, the most accurate methods, like the Ray-Tracing algorithm, are very costly regarding computing resources and takes a substantial amount of time to render a single frame. Many new techniques were conceived with the purpose of accelerating ray-tracing applications while obtaining results close to the desired. Moreover, Field-Programmable Gate Arrays (FPGAs) have recently become useful not only to prototype novel systems but also to run specialized parallel accelerators to execute the critical path of a given application. Nonetheless, embedded devices with processing capabilities and internet access generate a substantial increase of network traffic against distributed systems and cloud services, stimulating the development of Edge/Fog/In-Situ architectures and technologies. Thus, in this work, we present and analyze a Re-configurable Vector Accelerator specified in High-Level Synthesis (HLS) and the concept of a fog system that may use it. The accelerator is specialized in computing ray-triangle intersections and can be used in a distributed rendering environment. It has been implemented in a Xilinx Kintex Ultrascale FPGA (xcku060-ffva1156-2-e) using Xilinx Vivado tools. Experimental performance and energy consumption results show that the accelerator can efficiently render a simplified version of the Stanford Bunny model using different configurations with 1,2,4 and 8 Vector Cores.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Dubey, S. Chawdhary, J. A. Harris, O. E. Bronson Messer
{"title":"Simulation Planning Using Component Based Cost Model","authors":"A. Dubey, S. Chawdhary, J. A. Harris, O. E. Bronson Messer","doi":"10.1109/IPDPSW.2019.00116","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00116","url":null,"abstract":"Successful simulations for scientific discovery on high-performance computing platforms require careful planning, including verification of specific application configuration and runtime parameters, estimation of resource requirements, and steering and monitoring of the simulation. However, simulation planning is an aspect of scientific computing that is extremely sparse in available literature or training. In this paper we focus on the resource management aspect of such planning through formulation of a component-based cost model. We illustrate the methodology and formulation through FLASH, a highly configurable simulation code used in multiple scientific domains.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116014900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin W. Priest, Trevor Steil, G. Sanders, R. Pearce
{"title":"You've Got Mail (YGM): Building Missing Asynchronous Communication Primitives","authors":"Benjamin W. Priest, Trevor Steil, G. Sanders, R. Pearce","doi":"10.1109/IPDPSW.2019.00045","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00045","url":null,"abstract":"The Message Passing Interface (MPI) is the de facto standard for message handling in distributed computing. MPI collective communication schemes where many processors communicate with one another depend upon synchronous handshake agreements. This results in applications depending upon iterative collective communications moving at the speed of their slowest processors. We describe a methodology for bootstrapping asynchronous communication primitives to MPI, with an emphasis on irregular and imbalanced all-to-all communication patterns found in many data analytics applications. In such applications, the communication payload between a pair of processors is often small, requiring message aggregation on modern networks. In this work, we develop novel routing schemes that divide routing logically into local and remote routing. In these schemes, each core on a node is responsible for handing all local node sends and/or receives with a subset of remote cores. Collective communications route messages along their designated intermediaries, and are not influenced by the availability of cores not on their route. Unlike conventional synchronous collectives, cores participating in these schemes can enter the protocol when ready and exit once all of their sends and receives are processed. We demonstrate, using simple benchmarks, how this collective communication improves overall wall clock performance, as well as bandwidth and core utilization, for applications with a high demand for arbitrary core-core communication and unequal computational load between cores.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124250631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiently Computing the Power Set in a Parallel Environment","authors":"R. Goodwin","doi":"10.1109/IPDPSW.2019.00100","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00100","url":null,"abstract":"We develop an approach to find the power set of a given set based on creating disjunctive normal form clauses (DNF) and a round robin load balancing algorithm in a parallel computing environment. Given a problem of size n, the DNF algorithms and the round robin load balance, we will compute the entire power set in O[n / |n/| iterations, concurrently. This reduction in iterations is significantly less than O(2^n) for the sequential algorithms that compute the power set found in Computer Science text books or found in internet searches. The round robin load balance algorithm assigns less than |n/2| processors to the power set problem of size n. This paper gives examples to the power set problem for a relatively large set.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114113457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Peachy Parallel Assignments (EduPar 2019)","authors":"O. Ozturk, Ben Glick, Jens Mache, David P. Bunde","doi":"10.1109/IPDPSW.2019.00064","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00064","url":null,"abstract":"Peachy Parallel Assignments are a resource for instructors teaching parallel and distributed programming. These are high-quality assignments, previously tested in class, that are readily adoptable. This collection of assignments includes face recognition, finding the electrical potential of a square wire, and heat diffusion. All of these come with sample assignment sheets and the necessary starter code.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126303124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kary A. C. S. Ocaña, Thaylon Guedes, Daniel de Oliveira
{"title":"ArrOW: Experiencing a Parallel Cloud-Based De Novo Assembler Workflow","authors":"Kary A. C. S. Ocaña, Thaylon Guedes, Daniel de Oliveira","doi":"10.1109/IPDPSW.2019.00039","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00039","url":null,"abstract":"Advances in next generation sequencing technologies has resulted in the generation of unprecedented volume of sequence data. DNA segments are combined into a reconstruction of the original genome using computer software called genome assemblers. Therefore, assembly now presents new challenges in terms of data management, query, and analysis due the huge number of read sequences and computing intensive CPU-memory algorithms. This restriction reduces the chances to uniformly cover space for exploring statistics, k-mer, software or eukaryotic genomes assembly. To address these issues, we present ArrOW, a cloud-based de novo Assembly clOud Workflow that explores the potential of provenance analytics and parallel computation provided by scientific workflow management systems as SciCumulus. We evaluate the overall performance of ArrOW using up to 256 cores in the Amazon AWS cloud. ArrOW reaches improvements up to 88.3% executing 1,000 reads of genomics datasets. We also highlight how data provenance analytics improved the efficiency for recovering assembling features of genomes.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131972839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neftali Watkinson, Aniket Shivam, A. Nicolau, A. Veidenbaum
{"title":"Teaching Parallel Computing and Dependence Analysis with Python","authors":"Neftali Watkinson, Aniket Shivam, A. Nicolau, A. Veidenbaum","doi":"10.1109/IPDPSW.2019.00061","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00061","url":null,"abstract":"Languages with a high level of abstraction, such as Python, are becoming popular among programmers and are being adopted as the primary programming language in pedagogy. A potential drawback of using such languages is that the architectural aspects, such as data layout in memory, get completely hidden. Therefore, the students have difficulty in understanding advanced computer science topics such as Parallel Computing. Computer architectures have evolved to allow multiple levels of parallelism. From mobile devices to supercomputers, a lot of tasks are done in parallel. Parallel Programming models have become ubiquitous and computer science graduates should know how to take advantage of those models. Therefore, it becomes necessary to expose students to the concepts of parallel programming early in the curriculum. This work describes a lesson plan for teaching Parallel Computing, using Data Dependence analysis and Loop transformations, to Python Programming students. We analyze our teaching experience, evaluation of students' understanding and likelihood of using parallel programming in introductory courses in the future.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131902828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences","authors":"Mael Kerbiriou, R. Chikhi","doi":"10.1109/IPDPSW.2019.00042","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00042","url":null,"abstract":"Decompressing a file made by the gzip program at an arbitrary location is in principle impossible, due to the nature of the DEFLATE compression algorithm. Consequently, no existing program can take advantage of parallelism to rapidly decompress large gzip-compressed files. This is an unsatisfactory bottleneck, especially for the analysis of large sequencing data experiments. Here we propose a parallel algorithm and an implementation, pugz, that performs fast and exact decompression of any text file. We show that pugz is an order of magnitude faster than gunzip, and 5x faster than a highly-optimized sequential implementation (libdeflate). We also study the related problem of random access to compressed data. We give simple models and experimental results that shed light on the structure of gzip-compressed files containing DNA sequences. Preliminary results show that random access to sequences within a gzip-compressed FASTQ file is almost always feasible at low compression levels, yet is approximate at higher compression levels.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"322 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Upasana Sridhar, Mark P. Blanco, Rahul Mayuranath, Daniele G. Spampinato, Tze Meng Low, Scott McMillan
{"title":"Delta-Stepping SSSP: From Vertices and Edges to GraphBLAS Implementations","authors":"Upasana Sridhar, Mark P. Blanco, Rahul Mayuranath, Daniele G. Spampinato, Tze Meng Low, Scott McMillan","doi":"10.1109/IPDPSW.2019.00047","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00047","url":null,"abstract":"GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two approaches can pose considerable difficulties in the adoption of the GraphBLAS as standard interface for development. This paper investigates a systematic approach for translating a graph algorithm described in the canonical vertex and edge representation into an implementation that leverages the GraphBLAS interface. We present a two-step approach to this problem. First, we express common vertex-and edge-centric design patterns using a linear algebraic language. Second, we map this intermediate representation to the GraphBLAS interface. We illustrate our approach by translating the delta-stepping single source shortest path algorithm from its canonical description to a GraphBLAS implementation, and highlight lessons learned when implementing using GraphBLAS.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115585449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Programmable Acceleration for Sparse Matrices in a Data-Movement Limited World","authors":"Arjun Rawal, Yuanwei Fang, A. Chien","doi":"10.1109/IPDPSW.2019.00016","DOIUrl":"https://doi.org/10.1109/IPDPSW.2019.00016","url":null,"abstract":"Data movement cost is a critical performance concern in today's computing systems. We propose a heterogeneous architecture that combines a CPU core with an efficient data recoding accelerator and evaluate it on sparse matrix computation. Such computations underly a wide range of important computations such as partial differential equation solvers, sequence alignment, and machine learning and are often data movement limited. The data recoding accelerator is orders of magnitude more energy efficient than a conventional CPU for recoding, allowing sparse matrix representation to be optimized for data movement. We evaluate the heterogeneous system with a recoding accelerator using the TAMU sparse matrix library, studying >369 diverse sparse matrix examples finding geometric mean performance benefits of 2.4x. In contrast, CPU's exhibit poor recoding performance (up to 30x worse), making data representation optimization infeasible. Holding SpMV performance constant, adding the recoding optimization and accelerator can produce power reductions of 63% and 51% on DDR and HBM-based memory systems, respectively, when evaluated on a set of 7 representative matrices. These results show the promise of this new heterogeneous architecture approach.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124373705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}