2014 IEEE International Parallel & Distributed Processing Symposium Workshops最新文献

A New Parallel Algorithm for Two-Pass Connected Component Labeling 一种新的双通道连通分量标记并行算法

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2016-06-20 DOI: 10.1109/IPDPSW.2014.152

Siddharth Gupta, Diana Palsetia, Md. Mostofa Ali Patwary, Ankit Agrawal, A. Choudhary

{"title":"A New Parallel Algorithm for Two-Pass Connected Component Labeling","authors":"Siddharth Gupta, Diana Palsetia, Md. Mostofa Ali Patwary, Ankit Agrawal, A. Choudhary","doi":"10.1109/IPDPSW.2014.152","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.152","url":null,"abstract":"Connected Component Labeling (CCL) is one of the most important step in pattern recognition and image processing. It assigns labels to the pixels such that adjacent pixels sharing the same features are assigned the same label. Typically, CCL requires several passes over the data. We focus on two-pass technique where each pixel is given a provisional label in the first pass whereas an actual label is assigned in the second pass. We present a scalable parallel two-pass CCL algorithm, called PAREMSP, which employs a scan strategy and the best union-find technique called REMSP, which uses REM'S algorithm for storing label equivalence information of pixels in a 2-D image. In the first pass, we divide the image among threads and each thread runs the scan phase along with REMSP simultaneously. In the second phase, we assign the final labels to the pixels. As REMSP is easily parallelizable, we use the parallel version of REMSP for merging the pixels on the boundary. Our experiments show the scalability of PAREMSP achieving speedups up to 20.1 using 24 cores on shared memory architecture using OpenMP for an image of size 465.20 MB. We find that our proposed parallel algorithm achieves linear scaling for a large resolution fixed problem size as the number of processing elements are increased. Additionally, the parallel algorithm does not make use of any hardware specific routines, and thus is highly portable.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127533950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

HPDIC Introduction and Committees HPDIC简介和委员会

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2014.230

C. Cérin, Congfeng Jiang

引用次数: 0

RAW Introduction and Committees RAW简介及委员会

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2015-05-25 DOI: 10.1109/IPDPSW.2014.208

J. Becker, R. Vaidyanathan, M. Santambrogio, J. Tørresen, R. Sass, P. Leong

引用次数: 0

PLC Introduction and Committees PLC介绍及委员会

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.218

B. Chapman

引用次数: 0

Radiation Tolerance of Color Configuration on an Optically Reconfigurable Gate Array 光可重构门阵列颜色结构的辐射容限

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.27

Takumi Fujimori, Minoru Watanabe

引用次数: 1

Adaptive Booth Algorithm for Three-Integers Multiplication for Reconfigurable Mesh 可重构网格中三整数乘法的自适应Booth算法

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1142/S0219265915500097

Y. Ben-Asher, Esti Stein

引用次数: 1

Optimizing Buffer Sizes for Pipeline Workflow Scheduling with Setup Times 优化缓冲区大小的管道工作流调度与设置时间

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.77

A. Benoit, J. Nicod, V. Rehn-Sonigo

{"title":"Optimizing Buffer Sizes for Pipeline Workflow Scheduling with Setup Times","authors":"A. Benoit, J. Nicod, V. Rehn-Sonigo","doi":"10.1109/IPDPSW.2014.77","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.77","url":null,"abstract":"Mapping linear workflow applications onto a set of homogeneous processors can be optimally solved in polynomial time for the throughput objective with fewer processors than stages. This result even holds true, when setup times occur in the execution and homogeneous buffers are available for the storage of intermediate results. In this kind of applications, several computation stages are interconnected as a linear application graph, and each stage holds a buffer of limited size where intermediate results are stored and a processor setup time occurs when passing from one stage to another. In this paper, we tackle the problem where the buffer sizes are not given beforehand and have to be fixed before the execution to maximize the throughput within each processor. The goal of this work is to minimize the cost induced by the setup times allocating buffers with proportional sizes of each other. We present a closed formula to compute the optimal buffer allocation in the case of non-decreasing setup costs in the linear application. For the case of unsorted setup times, we provide competitive heuristics that are validated via extensive simulation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126731528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs 动态平衡同步-避免多核和gpu的LU分解

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.109

Simplice Donfack, S. Tomov, J. Dongarra

{"title":"Dynamically Balanced Synchronization-Avoiding LU Factorization with Multicore and GPUs","authors":"Simplice Donfack, S. Tomov, J. Dongarra","doi":"10.1109/IPDPSW.2014.109","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.109","url":null,"abstract":"Graphics processing units (GPUs) brought huge performance improvements in the scientific and numerical fields. We present an efficient hybrid CPU/GPU approach that is portable, dynamically and efficiently balances the workload between the CPUs and the GPUs, and avoidsdata transfer bottlenecks that are frequently present in numerical algorithms. Our approach determines the amount of initial work to assign to the CPUs before the execution, and then dynamically balances workloads during the execution. Then, we present a theoretical model to guide the choice of the initial amount of work for the CPUs. The validation of our model allows our approach to self-adapt on any architecture using the manufacturer's characteristics of the underlying machine. We illustrate our method for the LU factorization. For this case, we show that the use of our approach combined with a communication avoiding LU algorithm is efficient. For example, our experiments on a 24 cores AMD opteron 6172 show that by adding one GPU (Tesla S2050) we accelerate LU up to 2.4× compared to the corresponding routine in MKL using 24 cores. The comparisons with MAGMA also show significant improvements.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116134855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Nanoscale Cluster Detection in Massive Atom Probe Tomography Data 海量原子探针层析成像数据中的纳米级簇检测

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.133

S. Seal, Srikanth B. Yoginath, Michael K. Miller

{"title":"Nanoscale Cluster Detection in Massive Atom Probe Tomography Data","authors":"S. Seal, Srikanth B. Yoginath, Michael K. Miller","doi":"10.1109/IPDPSW.2014.133","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.133","url":null,"abstract":"Recent technological advances in atom probe tomography (APT) have led to unprecedented data acquisition capabilities that routinely generate data sets containing hundreds of millions of atoms. Detecting nanoscale clusters of different atom types present in these enormous amounts of data and analyzing their spatial correlations with one another are fundamental to understanding the structural properties of the material from which the data is derived. Extant algorithms for nanoscale cluster detection do not scale to large data sets. Here, a scalable, CUDA-based implementation of an autocorrelation algorithm is presented. It isolates spatial correlations amongst atomic clusters present in massive APT data sets in linear time using a linear amount of storage. Correctness of the algorithm is demonstrated using large synthetically generated data with known spatial distributions. Benefits and limitations of using GPU-acceleration for autocorrelation-based APT data analyses are presented with supporting performance results on data sets with up to billions of atoms. To our knowledge, this is the first nanoscale cluster detection algorithm that scales to massive APT data sets and executes on commodity hardware.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122705916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

SWIFT: A Transparent and Flexible Communication Layer for PCIe-Coupled Accelerators and (Co-)Processors SWIFT:用于pcie耦合加速器和(Co-)处理器的透明和灵活的通信层

2014 IEEE International Parallel & Distributed Processing Symposium Workshops Pub Date : 2014-05-19 DOI: 10.1109/IPDPSW.2014.48

Simon Pickartz, Pablo Reble, Carsten Clauss, Stefan Lankes

{"title":"SWIFT: A Transparent and Flexible Communication Layer for PCIe-Coupled Accelerators and (Co-)Processors","authors":"Simon Pickartz, Pablo Reble, Carsten Clauss, Stefan Lankes","doi":"10.1109/IPDPSW.2014.48","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.48","url":null,"abstract":"The Peripheral Component Interconnect Express (PCIe) is the predominant interconnect enabling the CPU to communicate with attached input/output and storage devices. Considering its high performance and capabilities to connect different address domains via the so-called Non-Transparent Bridging (NTB) technology, it starts to be an alternative or addition to traditional interconnects. The PCIe technology enables devices to communicate in a peer-to-peer manner allowing for new implementation possibilities of tomorrow's high-performance systems. Components being attached to the same computer rack are connected by means of PCIe and the racks themselves by using traditional network technologies. This leads to a heterogeneous landscape of compute nodes and high-performance interconnects. The Socket Wheeled Intelligent Fabric Transport (SWIFT) takes up the challenge of programming these systems. The presented implementation is highly portable due to a hardware abstraction layer allowing for bringing the implemented concepts to new interconnects with minimal effort. It is evaluated on a test system exposing different compute nodes equipped with coprocessors, which take part in a PCIe non-transparent bridging architecture. Besides low-level benchmarks investigating principal performance characteristics of the communication layer, MPI benchmark results are presented illustrating how scientific applications may be ported to heterogeneous environments.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129153151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2