Siddharth Gupta, Diana Palsetia, Md. Mostofa Ali Patwary, Ankit Agrawal, A. Choudhary
{"title":"A New Parallel Algorithm for Two-Pass Connected Component Labeling","authors":"Siddharth Gupta, Diana Palsetia, Md. Mostofa Ali Patwary, Ankit Agrawal, A. Choudhary","doi":"10.1109/IPDPSW.2014.152","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.152","url":null,"abstract":"Connected Component Labeling (CCL) is one of the most important step in pattern recognition and image processing. It assigns labels to the pixels such that adjacent pixels sharing the same features are assigned the same label. Typically, CCL requires several passes over the data. We focus on two-pass technique where each pixel is given a provisional label in the first pass whereas an actual label is assigned in the second pass. We present a scalable parallel two-pass CCL algorithm, called PAREMSP, which employs a scan strategy and the best union-find technique called REMSP, which uses REM'S algorithm for storing label equivalence information of pixels in a 2-D image. In the first pass, we divide the image among threads and each thread runs the scan phase along with REMSP simultaneously. In the second phase, we assign the final labels to the pixels. As REMSP is easily parallelizable, we use the parallel version of REMSP for merging the pixels on the boundary. Our experiments show the scalability of PAREMSP achieving speedups up to 20.1 using 24 cores on shared memory architecture using OpenMP for an image of size 465.20 MB. We find that our proposed parallel algorithm achieves linear scaling for a large resolution fixed problem size as the number of processing elements are increased. Additionally, the parallel algorithm does not make use of any hardware specific routines, and thus is highly portable.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127533950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Becker, R. Vaidyanathan, M. Santambrogio, J. Tørresen, R. Sass, P. Leong
{"title":"RAW Introduction and Committees","authors":"J. Becker, R. Vaidyanathan, M. Santambrogio, J. Tørresen, R. Sass, P. Leong","doi":"10.1109/IPDPSW.2014.208","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.208","url":null,"abstract":"This book presents the proceedings of the 21th Reconfigurable Architectures Workshop (RAW 2014) held in Phoenix, USA, on May 19-20, 2014. RAW 2014 is associated with the 28th Annual International Parallel & Distributed Processing Symposium (IPDPS 2014) and is sponsored by the IEEE Computer Society's Technical Committee on Parallel Processing. The workshop is one of the major meetings for researchers to present ideas, results, and ongoing research on both theoretical and practical advances in Reconfigurable Computing.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-05-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127459903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PLC Introduction and Committees","authors":"B. Chapman","doi":"10.1109/IPDPSW.2014.218","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.218","url":null,"abstract":"Workshop Theme Three essential pillars to successful parallel computing Productivity, Portability, Performance. Creating software for heterogeneous systems can be quite complex especially when the low-level details need to be managed and abstracted from the programmer. Emerging standards are providing an incremental development to target heterogeneous architectures, be it NVIDIA, ARM, Intel or AMD. We all know software is an expensive investment. Portability is necessary, ensuring long lifetime of the software and thus reducing the maintenance cost. Other challenges include locality and memory issues, load balancing, hiding latency with concurrency and so on. This workshop aims to brainstorm ways to make programming heterogeneous systems less challenging and more interesting. We believe that this workshop will provide a forum for the presentation and discussions of research on all aspects of heterogeneous systems programming models, compiler optimizations, language extensions, and software tools for such systems. Areas of interest include but are not limited to the following topics:","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Radiation Tolerance of Color Configuration on an Optically Reconfigurable Gate Array","authors":"Takumi Fujimori, Minoru Watanabe","doi":"10.1109/IPDPSW.2014.27","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.27","url":null,"abstract":"Currently, research of optically reconfigurable gate arrays (ORGAs), a type of multi-context field programmable gate arrays (FPGAs), has progressed rapidly. ORGAs offer important benefits of high-speed reconfiguration, numerous reconfiguration contexts, and robust configuration. Such ORGAs always consist of a single-wavelength laser array to address configuration contexts. However, for this architecture, concerns related to its package size often arise. The laser array is large because of the large space between lasers. For that reason, the ORGA also tends to be large. Therefore, we have introduced some-wavelength lasers inside a laser array of an ORGA to decrease the laser array size. Results show that the ORGA package can be smaller. However, the dependability of color configuration has never been discussed up to now. This paper therefore presents a demonstration of the radiation tolerance of color configuration on an optically reconfigurable gate array.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"52 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120861365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive Booth Algorithm for Three-Integers Multiplication for Reconfigurable Mesh","authors":"Y. Ben-Asher, Esti Stein","doi":"10.1142/S0219265915500097","DOIUrl":"https://doi.org/10.1142/S0219265915500097","url":null,"abstract":"This paper presents a three-integers multiplication algorithm R = A * X * Y for Reconfigurable Mesh (RM). It is based on a three-integer multiplication algorithm for faster FPGA implementations. We show that multiplying three integers of n bits can be performed on a 3D RM of size (3n+log n + 1)×(2√n+1+3) × √n+1 using 44+18.log log MNO steps, where MNO is a bound which is related to the number of sequences of '1's in the multiplied numbers. The value of MNO is bounded by n but experimentally we show that on the average it is sqrt n. Two algorithms for solving multiplication on a RM exists and their techniques are asymptotically better time wise, O(1) and O(log*n), but they suffer from large hidden constants and slow data insertion time O(√n) respectively. The proposed algorithm is relatively simple and faster on the average (via sampling input values) then the previous two algorithms thus contributes in making the RM a practical and feasible model. Our experiments show a significant improvement in the expected number of elementary operations for the proposed algorithm.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125444293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Buffer Sizes for Pipeline Workflow Scheduling with Setup Times","authors":"A. Benoit, J. Nicod, V. Rehn-Sonigo","doi":"10.1109/IPDPSW.2014.77","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.77","url":null,"abstract":"Mapping linear workflow applications onto a set of homogeneous processors can be optimally solved in polynomial time for the throughput objective with fewer processors than stages. This result even holds true, when setup times occur in the execution and homogeneous buffers are available for the storage of intermediate results. In this kind of applications, several computation stages are interconnected as a linear application graph, and each stage holds a buffer of limited size where intermediate results are stored and a processor setup time occurs when passing from one stage to another. In this paper, we tackle the problem where the buffer sizes are not given beforehand and have to be fixed before the execution to maximize the throughput within each processor. The goal of this work is to minimize the cost induced by the setup times allocating buffers with proportional sizes of each other. We present a closed formula to compute the optimal buffer allocation in the case of non-decreasing setup costs in the linear application. For the case of unsorted setup times, we provide competitive heuristics that are validated via extensive simulation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126731528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs","authors":"Simplice Donfack, S. Tomov, J. Dongarra","doi":"10.1109/IPDPSW.2014.109","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.109","url":null,"abstract":"Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoidsdata transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4× compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116134855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nanoscale Cluster Detection in Massive Atom Probe Tomography Data","authors":"S. Seal, Srikanth B. Yoginath, Michael K. Miller","doi":"10.1109/IPDPSW.2014.133","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.133","url":null,"abstract":"Recent technological advances in atom probe tomography (APT) have led to unprecedented data acquisition capabilities that routinely generate data sets containing hundreds of millions of atoms. Detecting nanoscale clusters of different atom types present in these enormous amounts of data and analyzing their spatial correlations with one another are fundamental to understanding the structural properties of the material from which the data is derived. Extant algorithms for nanoscale cluster detection do not scale to large data sets. Here, a scalable, CUDA-based implementation of an autocorrelation algorithm is presented. It isolates spatial correlations amongst atomic clusters present in massive APT data sets in linear time using a linear amount of storage. Correctness of the algorithm is demonstrated using large synthetically generated data with known spatial distributions. Benefits and limitations of using GPU-acceleration for autocorrelation-based APT data analyses are presented with supporting performance results on data sets with up to billions of atoms. To our knowledge, this is the first nanoscale cluster detection algorithm that scales to massive APT data sets and executes on commodity hardware.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122705916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Pickartz, Pablo Reble, Carsten Clauss, Stefan Lankes
{"title":"SWIFT: A Transparent and Flexible Communication Layer for PCIe-Coupled Accelerators and (Co-)Processors","authors":"Simon Pickartz, Pablo Reble, Carsten Clauss, Stefan Lankes","doi":"10.1109/IPDPSW.2014.48","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.48","url":null,"abstract":"The Peripheral Component Interconnect Express (PCIe) is the predominant interconnect enabling the CPU to communicate with attached input/output and storage devices. Considering its high performance and capabilities to connect different address domains via the so-called Non-Transparent Bridging (NTB) technology, it starts to be an alternative or addition to traditional interconnects. The PCIe technology enables devices to communicate in a peer-to-peer manner allowing for new implementation possibilities of tomorrow's high-performance systems. Components being attached to the same computer rack are connected by means of PCIe and the racks themselves by using traditional network technologies. This leads to a heterogeneous landscape of compute nodes and high-performance interconnects. The Socket Wheeled Intelligent Fabric Transport (SWIFT) takes up the challenge of programming these systems. The presented implementation is highly portable due to a hardware abstraction layer allowing for bringing the implemented concepts to new interconnects with minimal effort. It is evaluated on a test system exposing different compute nodes equipped with coprocessors, which take part in a PCIe non-transparent bridging architecture. Besides low-level benchmarks investigating principal performance characteristics of the communication layer, MPI benchmark results are presented illustrating how scientific applications may be ported to heterogeneous environments.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129153151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}