{"title":"Towards Communication Profile, Topology and Node Failure Aware Process Placement","authors":"Ioannis Vardas, Manolis Ploumidis, M. Marazakis","doi":"10.1109/SBAC-PAD49847.2020.00041","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00041","url":null,"abstract":"HPC systems need to keep growing in size to meet the ever-increasing demand for high levels of capability and capacity, often in tight time windows for urgent computation. However, increasing the size, complexity and heterogeneity of HPC systems also increases the risk and impact of system failures, that result in resource waste and aborted jobs. A major contributor to job completion time is the cost of interprocess communication. To address performance and energy efficiency, several prior studies have targeted improvements of communication locality. To meet this goal, they derive a mapping of MPI processes to system nodes in a way that reduces communication cost. However, such approaches disregard the effect of system failures. In this work, we propose a resource allocation approach for MPI jobs, considering both high performance and error resilience. Our approach, named Communication Profile, Topology and node Failure (CPTF), takes into account the application's communication profile, system topology and node failure probability for assigning job processes to nodes. We evaluate variants of CPTF through simulations of two MPI applications, one with a regular communication pattern (LAMMPS) and one with an irregular one (NPB-DT). In both cases, the variant of CPTF that strives to avoid failure-prone nodes and communication paths achieves lower time to complete job batches when compared to the default resource allocation policy of Slurm. It also exhibits the lowest ratio of aborted jobs. The average improvement in batch completion time is 67% for NPB-DT and 34% for LAMMPS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127903331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí
{"title":"High Performance and Portable Convolution Operators for Multicore Processors","authors":"Pablo San Juan, Adrián Castelló, M. F. Dolz, P. Alonso-Jordá, E. S. Quintana‐Ortí","doi":"10.1109/SBAC-PAD49847.2020.00023","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00023","url":null,"abstract":"The considerable impact of Convolutional Neural Networks on many Artificial Intelligence tasks has led to the development of various high performance algorithms for the convolution operator present in this type of networks. One of these approaches leverages the IM2COL transform followed by a general matrix multiplication (GEMM) in order to take advantage of the highly optimized realizations of the GEMM kernel in many linear algebra libraries. The main problems of this approach are 1) the large memory workspace required to host the intermediate matrices generated by the IM2COL transform; and 2) the time to perform the IM2COL transform, which is not negligible for complex neural networks. This paper presents a portable high performance convolution algorithm based on the BLIS realization of the GEMM kernel that avoids the use of the intermediate memory by taking advantage of the BLIS structure. In addition, the proposed algorithm eliminates the cost of the explicit IM2COL transform, while maintaining the portability and performance of the underlying realization of GEMM in BLIS.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"2002 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131372439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Controlling Garbage Collection and Request Admission to Improve Performance of FaaS Applications","authors":"David Quaresma, Daniel Fireman, T. Pereira","doi":"10.1109/SBAC-PAD49847.2020.00033","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00033","url":null,"abstract":"Runtime environments like Java's JRE, .NET's CLR, and Ruby's MRI, are popular choices for cloud-based applications and particularly in the Function as a Service (FaaS) serverless computing context. A critical component of runtime environments of these languages is the garbage collector (GC). The GC frees developers from manual memory management, which could potentially ease development and avoid bugs. The benefits of using the GC come with a negative impact on performance; that impact happens because either the GC needs to pause the runtime execution or competes with the running program for computational resources. In this work, we evaluated the usage of a technique - Garbage Collector Control Interceptor (GCI) - that eliminates the negative impact of GC on performance by controlling GC executions and transparently shedding requests while the collections are happening. We executed experiments simulating AWS Lambda's behavior and found that GCI is a viable solution. It benefited the user by improving the response time up to 10.86% at 99.9th percentile and reducing cost by 7.22%, but it also helped the platform provider by improving resource utilization by 14.52%.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127683345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer
{"title":"Hardware Multiversioning for Fail-Operational Multithreaded Applications","authors":"Rico Amslinger, Christian Piatka, Florian Haas, Sebastian Weis, T. Ungerer, S. Altmeyer","doi":"10.1109/SBAC-PAD49847.2020.00014","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00014","url":null,"abstract":"Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124541331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev
{"title":"Optimized Transactional Data Structure Approach to Concurrency Control for In-Memory Databases","authors":"Christina L. Peterson, Amalee Wilson, P. Pirkelbauer, D. Dechev","doi":"10.1109/SBAC-PAD49847.2020.00025","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00025","url":null,"abstract":"The optimistic concurrency control (OCC) utilized by in-memory databases performs writes on thread-local copies and makes the writes visible upon passing validation. However, high contention workloads suffer from failure of the validation step due to non-semantic memory access conflicts, leading to frequent transaction aborts. In this work, we improve the commit rate of in-memory databases by replacing OCC and the underlying indexing of key-value entries in the Silo database with a lock-free transactional dictionary. To further optimize the transactional commit rate, we present transactional merging, a technique that relaxes the semantic conflict resolution of transactional data structures by merging conflicting operations to reduce aborts. Transactional merging guarantees strict serializability through a strategy that recovers the correct abstract state given that a transaction attempting to merge operations aborts. The experimental evaluation demonstrates that the lock-free transactional dictionary with transactional merging achieves an average speedup of 175% over OCC and the Masstree indexing used in the Silo database for write-dominated workloads on a non-uniform memory access system.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122872461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu
{"title":"Re-evaluation of Atomic Operations and Graph Coloring for Unstructured Finite Volume GPU Simulations","authors":"Xi Zhang, Xu Sun, Xiaohu Guo, Yunfei Du, Yutong Lu, Yang Liu","doi":"10.1109/SBAC-PAD49847.2020.00048","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00048","url":null,"abstract":"In general, race condition can be resolved by introducing synchronisations or breaking data dependencies. Atomic operations and graph coloring are the two typical approaches to avoid race condition. Graph coloring algorithms have been generally considered winning algorithms in the literature due to their lock free implementations. In this paper, we present the GPU-accelerated algorithms of the unstructured cell-centered finite volume Computational Fluid Dynamics (CFD) software framework named PHengLEI which was originally developed for aerodynamics applications with arbitrary hybrid meshes. Overall, the newly developed GPU framework demonstrate up to 4.8 speedup comparing with 18 MPI tasks run on the latest Intel CPU node. Furthermore, the enormous efforts have been invested to optimize data dependencies which could lead to race condition due to unstructured mesh indirect addressing and related reduction math operations. With careful comparison between our optimised graph coloring and atomic operations using a series of numerical tests with different mesh sizes, the results show that atomic operations are more efficient than our optimised graph coloring in all of the test cases on Nvidia Tesla GPU V100. Specifically, for the summation operation, using atomicAdd is twice as fast as graph coloring. For the maximum operation, a speedup of 1.5 to 2 is found for atomicMax vs. graph coloring.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131709108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending Heterogeneous Applications to Remote Co-processors with rOpenCL","authors":"Rui Alves, J. Rufino","doi":"10.1109/SBAC-PAD49847.2020.00049","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00049","url":null,"abstract":"In heterogeneous computing systems, general purpose CPUs are coupled with co-processors of different architectures, like GPUs and FPGAs. Applications may take advantage of this heterogeneous device ensemble to accelerate execution. However, developing heterogeneous applications requires specific programming models, under which applications unfold into code components targeting different computing devices. OpenCL is one of the main programming models for heterogeneous applications, set apart from others due to its openness, vendor independence and support for different co-processors. In the original OpenCL application model, a heterogeneous application starts in a certain host node, and then resorts to the local co-processors attached to that host. Therefore, co-processors at other nodes, networked with the host node, are inaccessible and cannot be used to accelerate the application. rOpenCL (remote OpenCL) overcomes this limitation for a significant set of the OpenCL 1.2 API, offering OpenCL applications transparent access to remote devices through a TPC/IP based network. This paper presents the architecture and the most relevant implementation details of rOpenCL, together with the results of a preliminary set of reference benchmarks. These prove the stability of the current prototype and show that, in many scenarios, the network overhead is smaller than expected.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134321468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Highly Efficient SGEMM Implementation using DMA on the Intel/Movidius Myriad-2","authors":"Suyash Bakshi, L. Johnsson","doi":"10.1109/SBAC-PAD49847.2020.00051","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00051","url":null,"abstract":"Reducing energy consumption and achieving high energy efficiency in computation has become the top priority in High Performance Computing. High energy efficiency generally requires high resource utilization since energy demand for any applications and architectures is dependent on active time. We show that by using DMA the 28nm CMOS node Myriad-2 Vision Processing Unit can achieve 25 GFLOPs/W for FP32 matrixmultiplication. Our main contributions are: (i) An analysis of data transfer needs for inner and outer-product formulations of matrix multiplication with respect to the Myriad-2 memory hierarchy, (ii) An efficient use of DMA for managing matrix block transfers between on-chip and main memory (iii) A detailed analysis of the effects of matrix block shapes and DRAM page faults on performance and energy efficiency.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132296683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JAMPI: A C++ Parallel Programming Interface Allowing the Implementation of Custom and Generic Scheduling Mechanisms","authors":"D. D. Domenico, G. H. Cavalheiro","doi":"10.1109/SBAC-PAD49847.2020.00045","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00045","url":null,"abstract":"The widespread of modern parallel architectures brought many challenges in terms of programming. In response, many parallel programming tools intend to aid the user in order to exploit hardware resources effectively. As a new alternative, this paper introduces the design and implementation of JAMPI, a generic parallel programming interface developed in C++ focused on code reuse, productivity and high-level abstraction to enable the construction of parallel applications. JAMPI is totally integrated with its host programming language and offers, as main feature, a fully disassociation of its programming interface from its scheduling mechanism. Aiming to manage and optimize the parallel execution, the proposed interface allows the programmer to implement a custom scheduling heuristic for each portion of the application code. Besides JAMPI model description, we proceeded some preliminary experiments using applications encoded with the proposed framework. Results showed that JAMPI can be used to reach performance on multicore platforms and does not add performance penalty over the sequential version of the benchmarks.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131772770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel
{"title":"Evaluating Computation and Data Placements in Edge Infrastructures through a Common Simulator","authors":"A. Silva, Clément Mommessin, P. Neyron, D. Trystram, Adwait Bauskar, A. Lèbre, Alexandre van Kempen, Yanik Ngoko, Yoann Ricordel","doi":"10.1109/SBAC-PAD49847.2020.00020","DOIUrl":"https://doi.org/10.1109/SBAC-PAD49847.2020.00020","url":null,"abstract":"Scheduling computational jobs with data-sets dependencies is an important challenge of edge computing infrastructures. Although several strategies have been proposed, they have been evaluated through ad-hoc simulator extensions that are, when available, usually not maintained. This is a critical problem because it prevents researchers to –easily– perform fair comparisons between different proposals. In this paper, we propose to address this limitation by presenting a simulation engine dedicated to the evaluation and comparison of scheduling and data movement policies for edge computing use-cases. Built upon the Batsim/SimGrid toolkit, our tool includes an injector that allows the simulator to replay a series of events captured in real infrastructures. It also includes a controller that supervises storage entities and data transfers during the simulation, and a plug-in system that allows researchers to add new models to cope with the diversity of edge computing devices. We demonstrate the relevance of such a simulation toolkit by studying two scheduling strategies with four data movement policies on top of a simulated version of the Qarnot Computing platform, a production edge infrastructure based on smart heaters. We chose this use-case as it illustrates the heterogeneity as well as the uncertainties of edge infrastructures. Our ultimate goal is to gather industry and academics around a common simulator so that efforts made by one group can be factorised by others.","PeriodicalId":202581,"journal":{"name":"2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133012236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}