{"title":"Scheduling Irregular Dataflow Pipelines on SIMD Architectures","authors":"Tom Plano, J. Buhler","doi":"10.1145/3380479.3380480","DOIUrl":"https://doi.org/10.1145/3380479.3380480","url":null,"abstract":"Streaming computations often exhibit substantial data parallelism that makes them well-suited to SIMD architectures. However, many such computations also exhibit irregularity, in the form of data-dependent, dynamic data rates, that makes efficient SIMD execution challenging. One aspect of this challenge is the need to schedule execution of a computation realized as a pipeline of stages connected by finite queues. A scheduler must both ensure high SIMD occupancy by gathering queued items into vectors and minimize costs associated with switching execution between stages. In this work, we present the AFIE (Active Full, Inactive Empty) scheduling policy for irregular streaming applications on SIMD processors. AFIE provably groups inputs to each stage of a pipeline into a minimal number of SIMD vectors while incurring a bounded number of switches relative to the best possible policy. These results apply even though irregularity forbids a priori knowledge of how many outputs will be generated from each input to each stage. We have implemented AFIE as an extension to the MERCATOR system [6] for building irregular streaming applications on NVIDIA GPUs. We describe how the AFIE scheduler simplifies MERCATOR's runtime code and empirically measure the new scheduler's improved performance on irregular streaming applications.","PeriodicalId":164160,"journal":{"name":"Proceedings of the 2020 Sixth Workshop on Programming Models for SIMD/Vector Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125108499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Régis Pierrard, Laurent Cabaret, Jean-Philippe Poli, C. Hudelot
{"title":"SIMD-based Exact Parallel Fuzzy Dilation Operator for Fast Computing of Fuzzy Spatial Relations","authors":"Régis Pierrard, Laurent Cabaret, Jean-Philippe Poli, C. Hudelot","doi":"10.1145/3380479.3380482","DOIUrl":"https://doi.org/10.1145/3380479.3380482","url":null,"abstract":"For decades, fuzzy spatial relations have demonstrated their utility and effectiveness for visual reasoning, including semantic annotation and object recognition. However, a major issue is that they often involve fuzzy morphological operators that are compute-intensive leading to long latency in the relation evaluation. As a result, approximate methods have been proposed to compute some relations in an acceptable time, but they are not as generic as the fuzzy dilation or do not make the most of modern computing architectures. In this paper, we introduce the Reverse and the Parallel Reverse (PR) algorithms. Reverse is an exact and efficient algorithm for the fuzzy dilation operator and PR combines the Reverse algorithm exactness with efficient usage of modern-processor multiple cores using OpenMP. Using SIMD extensions to enhance Parallel Reverse, PR128 (AVX), PR256 (AVX2), and PR512 (AVX512) are faster than the state-of-the-art approximate methods while remaining generic and exact. To demonstrate the performance of PR and highlight the contribution of the SIMD instructions, an extensive benchmark was carried out on two datasets of natural and artificial images.","PeriodicalId":164160,"journal":{"name":"Proceedings of the 2020 Sixth Workshop on Programming Models for SIMD/Vector Processing","volume":"326 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124297867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to speed Connected Component Labeling up with SIMD RLE algorithms","authors":"F. Lemaitre, A. Hennequin, L. Lacassagne","doi":"10.1145/3380479.3380481","DOIUrl":"https://doi.org/10.1145/3380479.3380481","url":null,"abstract":"The research in Connected Component Labeling, although old, is still very active and several efficient algorithms for CPUs and GPUs have emerged during the last years and are always improving the performance. This article introduces a new SIMD run-based algorithm for CCL. We show how RLE compression can be SIMDized and used to accelerate scalar run-based CCL algorithms. A benchmark done on Intel, AMD and ARM processors shows that this new algorithm outperforms the State-of-the-Art by an average factor of x1.7 on AVX2 machines and x1.9 on Intel Xeon Skylake with AVX512.","PeriodicalId":164160,"journal":{"name":"Proceedings of the 2020 Sixth Workshop on Programming Models for SIMD/Vector Processing","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128579880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}