2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum最新文献

Message from the HCW Steering Committee Chair HCW指导委员会主席致辞

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2019-05-01 DOI: 10.1109/IPDPSW.2016.220

B. Shirazi

引用次数: 0

Unstructured Control Flow in GPGPU GPGPU中的非结构化控制流

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.247

Rodrigo Dominguez, D. Kaeli

{"title":"Unstructured Control Flow in GPGPU","authors":"Rodrigo Dominguez, D. Kaeli","doi":"10.1109/IPDPSW.2013.247","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.247","url":null,"abstract":"The current trend toward heterogeneous architectures motivates us to reconsider current software and hardware paradigms. The focus is centered around new parallel programming models, compiler design, and runtime resource management techniques to exploit the features of many-core processor architectures. Graphics Processing Units (GPU) have become the platform of choice in this area for accelerating a large range of data-parallel and task-parallel applications. The rapid adoption of GPU computing has been greatly aided by the introduction of high-level programming environments such as CUDA C and OpenCL. However, each vendor implements these programming models differently and we must analyze the internals in order to get a better understanding of the performance results. One of the main differences across implementations is the handling of program control flow by the compiler and the hardware. Some implementations can support unstructured control flow based on branches and labels; others are based on structured control flow relying solely on if-then and while constructs. In this paper we describe a tool that can be used to analyze the difference between these two approaches. We created a dynamic compiler called Caracal that translates applications with unstructured control flow so they can run on hardware that requires structured programs. In order to accomplish this, Caracal builds a control tree of the program and creates single-entry, single-exit regions called hammock graphs. We used this tool to analyze the performance differences between NVIDIA's implementation of CUDA C and AMD's implementation of OpenCL. We found that the requirement for structured control flow can increase the number of registers allocated by 20 registers and impact performance as much as 2x.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115477776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Identifying High betweenness Centrality Vertices in Large Noisy Networks 大型噪声网络中高中间度中心性点的识别

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.171

Vladimir Ufimtsev, S. Bhowmick

引用次数: 10

Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic Parallelization 使用乐观并行化避免共享内存并行BFS中的锁和原子指令

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.241

Jesmin Jahan Tithi, Dhruv Mátáni, Gaurav Menghani, R. Chowdhury

{"title":"Avoiding Locks and Atomic Instructions in Shared-Memory Parallel BFS Using Optimistic Parallelization","authors":"Jesmin Jahan Tithi, Dhruv Mátáni, Gaurav Menghani, R. Chowdhury","doi":"10.1109/IPDPSW.2013.241","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.241","url":null,"abstract":"Dynamic load-balancing in parallel algorithms typically requires locks and/or atomic instructions for correctness. We have shown that sometimes an optimistic parallelization approach can be used to avoid the use of locks and atomic instructions during dynamic load balancing. In this approach one allows potentially conflicting operations to run in parallel with the hope that everything will run without conflicts, and if any occasional inconsistencies arise due to conflicts, one will be able to handle them without hampering the overall correctness of the program. We have used this approach to implement two new types of high-performance lock free parallel BFS algorithms and their variants based on centralized job queues and distributed randomized work-stealing, respectively. These algorithms are implemented using Intel cilk++, and shown to be scalable and faster than two state-of-the-art multicore parallel BFS algorithms by Leiserson and Schardl (SPAA, 2010) and Hong et al. (PACT, 2011), where the algorithm described in the fast paper is also free of locks and atomic instructions but does not use optimistic parallelization. Our implementations can also handle scale-free graphs very efficiently which frequently arise in real-world scenarios such as the World Wide Web, social-networks, biological interaction networks, etc.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115764325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Semi-Matching Algorithms for Scheduling Parallel Tasks under Resource Constraints 资源约束下并行任务调度的半匹配算法

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.30

Anne Benoit, Johannes Langguth, B. Uçar

引用次数: 3

ASHES Introduction 骨灰的介绍

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.290

Jiayuan Meng

引用次数: 0

Application of Evolutionary Algorithms to Maximum Lifetime Coverage Problem in Wireless Sensor Networks 进化算法在无线传感器网络最大寿命覆盖问题中的应用

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.96

A. Tretyakova, F. Seredyński

引用次数: 13

Scalable, Multithreaded, Partially-in-Place Sorting 可伸缩、多线程、局部就地排序

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.74

D. Haglin, Robert Adolf, Greg E. Mackey

引用次数: 1

Efficient Hough Transform on the FPGA using DSP Slices and Block RAMs 基于DSP片和块ram的FPGA高效霍夫变换

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.86

Xin Zhou, Norihiro Tomagou, Yasuaki Ito, K. Nakano

{"title":"Efficient Hough Transform on the FPGA using DSP Slices and Block RAMs","authors":"Xin Zhou, Norihiro Tomagou, Yasuaki Ito, K. Nakano","doi":"10.1109/IPDPSW.2013.86","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.86","url":null,"abstract":"The main contribution of this paper is to present a new FPGA architecture for the Hough transform that identifies straight lines in a binary image. Recent FPGAs have hundreds of embedded DSP slices and block RAMs. For example, Xilinx Virtex-6 Family FPGAs have a DSP48E1 slice, which is a configurable logic block equipped with fast multipliers, adders, pipeline registers, and so on. They also have a dual-port memory with 18Kbits as a block RAM. One of the most important key techniques for accelerating computation using FPGAs is an efficient usage ofDSP slices and block RAMs. Our new architecture for the Hough transform uses 178 DSP48E1 slices and 180 block RAMs with 18Kbits that work in parallel. As far as we know, there is no previously published work that fully utilizes DSP slices and block RAMs for the Hough transform. Roughly speaking, a conventional sequential implementation performs 180m voting operations for m edge points. Our architecture performs voting operations in parallel, and outputs identified straight lines in m+97 clock cycles. Since 180m voting operations are performed using 178 DSP48E1 slices, the lower bound of the computing time is m clock cycles. Hence our implementation is close to optimal. The implementation results show that the Hough transform for a 512×512 image with 33232 edge points can be done in only 135.75us.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123307230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Revisiting the Double Checkpointing Algorithm 重新审视双检查点算法

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.11

J. Dongarra, T. Hérault, Y. Robert

引用次数: 14