{"title":"Message from the WAMCA 2020 General Chair","authors":"","doi":"10.1109/sbac-pad49847.2020.00053","DOIUrl":"https://doi.org/10.1109/sbac-pad49847.2020.00053","url":null,"abstract":"WAMCA was created as an associated workshop of SBAC-PAD in 2009. The aim was to provide a specific channel for contributions and discussions on multi-core applications. Then, we have striven to implement it every year in conjunction with the corresponding SBAC-PAD. The initial topic has been extended to cover all topics related to shared memory parallelism and accelerators. This adaptation was necessary because most of accelerators that have emerged so far follow a shared memory model, even if the original data typically come from a remote main memory. This year, we received 23 submissions and accepted 10, with an average of 3 reviews per paper, thus an acceptance rate of 43%. We thank the authors of all submitted papers for their consideration and we expect to remain attractive for an increasingly larger community.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127751062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing the Loop Scheduling Mechanisms on Julia Multithreading","authors":"Diana A. Barros, C. Bentes","doi":"10.1109/SBAC-PAD49847.2020.00043","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00043","url":null,"abstract":"Julia is a quite recent dynamic language proposed to tackle the trade-off between productivity and efficiency. The idea is to provide the usability o flanguages such as Python or MATLAB side by sidewith the performance of C and C++. The support for multithreading programming in Julia was only released last year, and therefore still requires performance studies. In this work, we focus on the parallel loops and more specifically on the available mechanisms for assigning the loop iterations to the threads. We analyse the per-formance of the macros @spawn and @threads, used for loop parallelization. Our results show that there is no best fit solution for all cases. The use of @spawn provides better load balance for unbalanced loops with reasonably heavy iterations, but incurs in high overhead for workstealing. While @threads has low overhead, and workswell for loops with good balance among iterations.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114254242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg
{"title":"High-Performance Low-Memory Lowering: GEMM-based Algorithms for DNN Convolution","authors":"Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg","doi":"10.1109/SBAC-PAD49847.2020.00024","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00024","url":null,"abstract":"Deep Neural Network Convolution is often implemented with general matrix multiplication ( GEMM ) using the well-known im2col algorithm. This algorithm constructs a Toeplitz matrix from the input feature maps, and multiplies them by the convolutional kernel. With input feature map dimensions C × H × W and kernel dimensions M × C × K^2, im2col requires O(K^2CHW ) additional space. Although this approach is very popular, there has been little study of the associated design space. We show that the im2col algorithm is just one point in a regular design space of algorithms which translate convolution to GEMM. We enumerate this design space, and experimentally evaluate each algorithmic variant. Our evaluation yields several novel low-memory algorithms which match the performance of the best known approaches despite requiring only a small fraction of the additional memory.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114728487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Riccardo Mancini, Antonio Ritacco, Giacomo Lanciano, T. Cucinotta
{"title":"XPySom: High-Performance Self-Organizing Maps","authors":"Riccardo Mancini, Antonio Ritacco, Giacomo Lanciano, T. Cucinotta","doi":"10.1109/SBAC-PAD49847.2020.00037","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00037","url":null,"abstract":"In this paper, we introduce XPySom, a new open-source Python implementation of the well-known Self-Organizing Maps (SOM) technique. It is designed to achieve high performance on a single node, exploiting widely available Python libraries for vector processing on multi-core CPUs and GP-GPUs. We present results from an extensive experimental evaluation of XPySom in comparison to widely used open-source SOM implementations, showing that it outperforms the other available alternatives. Indeed, our experimentation carried out using the Extended MNIST open data set shows a speed-up of about 7x and 100x when compared to the best open-source multi-core implementations we could find with multi-core and GP-GPU acceleration, respectively, achieving the same accuracy levels in terms of quantization error.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127671056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongyang Sun, Ana Gainaru, Manu Shantharam, P. Raghavan
{"title":"Selective Protection for Sparse Iterative Solvers to Reduce the Resilience Overhead","authors":"Hongyang Sun, Ana Gainaru, Manu Shantharam, P. Raghavan","doi":"10.1109/SBAC-PAD49847.2020.00029","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00029","url":null,"abstract":"The increasing scale and complexity of today's high-performance computing (HPC) systems demand a renewed focus on enhancing the resilience of long-running scientific applications in the presence of faults. Many of these applications are iterative in nature as they operate on sparse matrices that concern the simulation of partial differential equations (PDEs) which numerically capture the physical properties on discretized spatial domains. While these applications currently benefit from many application-agnostic resilience techniques at the system level, such as checkpointing and replication, there is significant overhead in deploying these techniques. In this paper, we seek to develop application-aware resilience techniques that leverage an iterative application's intrinsic resiliency to faults and selectively protect certain elements, thereby reducing the resilience overhead. Specifically, we investigate the impact of soft errors on the widely used Preconditioned Conjugate Gradient (PCG) method, whose reliability depends heavily on the error propagation through the sparse matrix-vector multiplication (SpMV) operation. By characterizing the performance of PCG in correlation with a numerical property of the underlying sparse matrix, we propose a selective protection scheme that protects only certain critical elements of the operation based on an analytical model. An experimental evaluation using 20 sparse matrices from the SuiteSparse Matrix Collection shows that our proposed scheme is able to reduce the resilience overhead by as much as 70.2% and an average of 32.6% compared to the baseline techniques with full-protection or zero-protection.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121317829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Sydow, Mohannad Nabelsee, S. Glesner, Paula Herber
{"title":"Towards Profile-Guided Optimization for Safe and Efficient Parallel Stream Processing in Rust","authors":"Stefan Sydow, Mohannad Nabelsee, S. Glesner, Paula Herber","doi":"10.1109/SBAC-PAD49847.2020.00047","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00047","url":null,"abstract":"The efficient mapping of stream processing applications to parallel hardware architectures is a difficult problem. While parallelization is often highly desirable as it reduces the overall execution time, its advantages must be carefully weighed against the parallelization overhead of complexity and communication costs. This paper presents a novel profile-guided optimization for parallel stream processing based on the multi-paradigm system programming language Rust. Our approach's key idea is to systematically balance the performance gain that can be achieved from parallelization with the communication overhead. To achieve this, we 1) use profiling to gain tight estimates of task execution times, 2) evaluate the cost of the fundamental concurrency constructs in Rust with synthetic benchmarks, and exploit this information to estimate the communication overhead introduced by various degrees of parallelism, and 3) present a novel optimization algorithm that exploits both estimates to fine-tune the degree of parallelism and train processing in a given application. Overall, our approach enables us to map parallel stream processing applications to parallel hardware efficiently. The safety concepts anchored in Rust ensure the reliability of the resulting implementation. We demonstrate our approach's practical applicability with two case studies: the word count problem and aircraft telemetry decoding.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126807653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-chip Parallel Photonic Reservoir Computing using Multiple Delay Lines","authors":"S. Hasnain, R. Mahapatra","doi":"10.1109/SBAC-PAD49847.2020.00015","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00015","url":null,"abstract":"Silicon-Photonics architectures have enabled high speed hardware implementations of Reservoir Computing (RC). With a delayed feedback reservoir (DFR) model, only one non-linear node can be used to perform RC. However, the delay is often provided by using off-chip fiber optics which is not only space inconvenient but it also becomes architectural bottleneck and hinders to scalability. In this paper, we propose a completely on-chip photonic RC architecture for high performance computing, employing multiple electronically tunable delay lines and micro-ring resonator (MRR) switch for multi-tasking. Proposed architecture provides 84% less error compared to the state-of-the-art standalone architecture in [8] for executing NARMA task. For multi-tasking, the proposed architecture shows 80% better performance than [8]. The architecture outperforms all other proposed architectures as well. The on-chip area and power overhead of proposed architecture due to delay lines and MRR switch are 0.0184mm^2 and 26mW respectively.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114773316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Green Energy Consumption of Fog Computing Architectures","authors":"A. Gougeon, Benjamin Camus, Anne-Cécile Orgerie","doi":"10.1109/SBAC-PAD49847.2020.00021","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00021","url":null,"abstract":"The Cloud already represents an important part of the global energy consumption, and this consumption keeps increasing. Many solutions have been investigated to increase its energy efficiency and to reduce its environmental impact. However, with the introduction of new requirements, notably in terms of latency, an architecture complementary to the Cloud is emerging: the Fog. The Fog computing paradigm represents a distributed architecture closer to the end-user. Its necessity and feasibility keep being demonstrated in recent works. However, its impact on energy consumption is often neglected and the integration of renewable energy has not been considered yet. The goal of this work is to exhibit an energy-efficient Fog architecture considering the integration of renewable energy. We explore three resource allocation algorithms and three consolidation policies. Our simulation results, based on real traces, show that the intrinsic low computing capability of the nodes in a Fog context makes it harder to exploit renewable energy. In addition, the share of the consumption from the communication network between the computing resources increases in this context, and the communication devices are even harder to power through renewable sources.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast and Concise Parallel Implementation of the 8x8 2D IDCT using Halide","authors":"Martin J. Johnson, D. Playne","doi":"10.1109/SBAC-PAD49847.2020.00032","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00032","url":null,"abstract":"The Inverse Discrete Cosine Transform (IDCT) is commonly used for image and video decoding. Due to the ubiquitous nature of this application area, very efficient implementations of the IDCT transform are of great importance and have lead to the development of highly optimized libraries. The popular libjpeg-turbo library contains 1000s of lines of handwritten assembly code utilizing SIMD instruction sets for a variety of architectures. We present an alternative approach, implementing the 8x8 2D IDCT written in the image processing language Halide - a high-level, functional language that allows for concise, portable, parallel and very efficient code. We show how less than 100 lines of Halide can replace over 1000 lines of code for each architecture in the libjpeg-turbo library to perform JPEG decoding. The Halide implementation is compared for ARMv8 and x86-64 SIMD extensions and shows a 5-25 percent performance improvement over the SIMD code in libjpeg-turbo while also being much easier to maintain and port to new architectures.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127850931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Douglas Pereira Pasqualin, M. Diener, A. R. D. Bois, M. Pilla
{"title":"Online Sharing-Aware Thread Mapping in Software Transactional Memory","authors":"Douglas Pereira Pasqualin, M. Diener, A. R. D. Bois, M. Pilla","doi":"10.1109/SBAC-PAD49847.2020.00016","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00016","url":null,"abstract":"Software Transactional Memory (STM) is an alternative abstraction to synchronize processes in parallel programming. One advantage is simplicity since it is possible to replace the use of explicit locks with atomic blocks. Regarding STM performance, many studies already have been made focusing on reducing the number of aborts. However, in current multicore architectures with complex memory hierarchies, it is also important to consider where the memory of a program is allocated and how it is accessed. This paper proposes the use of a technique called sharing-aware mapping, which maps threads to cores of an application based on their memory access behavior, to achieve better performance in STM systems. We introduce STMap, an online, low overhead mechanism to detect the sharing behavior and perform the mapping directly inside the STM library, by tracking and analyzing how threads perform STM operations. In experiments with the STAMP benchmark suite and synthetic benchmarks, STMap shows performance gains of up to 77% on a Xeon system (17.5% on average) and 85% on an Opteron system (9.1% on average), compared to the Linux scheduler.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127867442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}